Pesquisa | Biblioteca Virtual em Saúde

1.

MUMmer4: A fast and versatile genome alignment system.

Marçais, Guillaume; Delcher, Arthur L; Phillippy, Adam M; Coston, Rachel; Salzberg, Steven L; Zimin, Aleksey.

PLoS Comput Biol ; 14(1): e1005944, 2018 01.

Artigo em Inglês | MEDLINE | ID: mdl-29373581

RESUMO

The MUMmer system and the genome sequence aligner nucmer included within it are among the most widely used alignment packages in genomics. Since the last major release of MUMmer version 3 in 2004, it has been applied to many types of problems including aligning whole genome sequences, aligning reads to a reference genome, and comparing different assemblies of the same genome. Despite its broad utility, MUMmer3 has limitations that can make it difficult to use for large genomes and for the very large sequence data sets that are common today. In this paper we describe MUMmer4, a substantially improved version of MUMmer that addresses genome size constraints by changing the 32-bit suffix tree data structure at the core of MUMmer to a 48-bit suffix array, and that offers improved speed through parallel processing of input query sequences. With a theoretical limit on the input size of 141Tbp, MUMmer4 can now work with input sequences of any biologically realistic length. We show that as a result of these enhancements, the nucmer program in MUMmer4 is easily able to handle alignments of large genomes; we illustrate this with an alignment of the human and chimpanzee genomes, which allows us to compute that the two species are 98% identical across 96% of their length. With the enhancements described here, MUMmer4 can also be used to efficiently align reads to reference genomes, although it is less sensitive and accurate than the dedicated read aligners. The nucmer aligner in MUMmer4 can now be called from scripting languages such as Perl, Python and Ruby. These improvements make MUMer4 one the most versatile genome alignment packages available.

Assuntos

Biologia Computacional/métodos , Alinhamento de Sequência/métodos , Software , Algoritmos , Animais , Arabidopsis/genética , Genoma Humano , Genoma de Planta , Genômica , Humanos , Modelos Teóricos , Pan troglodytes , Polimorfismo de Nucleotídeo Único , Linguagens de Programação , Análise de Sequência de DNA , Análise de Sequência de Proteína

2.

GAGE: A critical evaluation of genome assemblies and assembly algorithms.

Salzberg, Steven L; Phillippy, Adam M; Zimin, Aleksey; Puiu, Daniela; Magoc, Tanja; Koren, Sergey; Treangen, Todd J; Schatz, Michael C; Delcher, Arthur L; Roberts, Michael; Marçais, Guillaume; Pop, Mihai; Yorke, James A.

Genome Res ; 22(3): 557-67, 2012 Mar.

Artigo em Inglês | MEDLINE | ID: mdl-22147368

RESUMO

New sequencing technology has dramatically altered the landscape of whole-genome sequencing, allowing scientists to initiate numerous projects to decode the genomes of previously unsequenced organisms. The lowest-cost technology can generate deep coverage of most species, including mammals, in just a few days. The sequence data generated by one of these projects consist of millions or billions of short DNA sequences (reads) that range from 50 to 150 nt in length. These sequences must then be assembled de novo before most genome analyses can begin. Unfortunately, genome assembly remains a very difficult problem, made more difficult by shorter reads and unreliable long-range linking information. In this study, we evaluated several of the leading de novo assembly algorithms on four different short-read data sets, all generated by Illumina sequencers. Our results describe the relative performance of the different assemblers as well as other significant differences in assembly difficulty that appear to be inherent in the genomes themselves. Three overarching conclusions are apparent: first, that data quality, rather than the assembler itself, has a dramatic effect on the quality of an assembled genome; second, that the degree of contiguity of an assembly varies enormously among different assemblers and different genomes; and third, that the correctness of an assembly also varies widely and is not well correlated with statistics on contiguity. To enable others to replicate our results, all of our data and methods are freely available, as are all assemblers used in this study.

Assuntos

Algoritmos , Genômica/métodos , Análise de Sequência de DNA , Animais , Biologia Computacional/métodos , Genoma , Genoma Bacteriano/genética , Humanos , Internet , Reprodutibilidade dos Testes

3.

Hawkeye and AMOS: visualizing and assessing the quality of genome assemblies.

Schatz, Michael C; Phillippy, Adam M; Sommer, Daniel D; Delcher, Arthur L; Puiu, Daniela; Narzisi, Giuseppe; Salzberg, Steven L; Pop, Mihai.

Brief Bioinform ; 14(2): 213-24, 2013 Mar.

Artigo em Inglês | MEDLINE | ID: mdl-22199379

RESUMO

Since its launch in 2004, the open-source AMOS project has released several innovative DNA sequence analysis applications including: Hawkeye, a visual analytics tool for inspecting the structure of genome assemblies; the Assembly Forensics and FRCurve pipelines for systematically evaluating the quality of a genome assembly; and AMOScmp, the first comparative genome assembler. These applications have been used to assemble and analyze dozens of genomes ranging in complexity from simple microbial species through mammalian genomes. Recent efforts have been focused on enhancing support for new data characteristics brought on by second- and now third-generation sequencing. This review describes the major components of AMOS in light of these challenges, with an emphasis on methods for assessing assembly quality and the visual analytics capabilities of Hawkeye. These interactive graphical aspects are essential for navigating and understanding the complexities of a genome assembly, from the overall genome structure down to individual bases. Hawkeye and AMOS are available open source at http://amos.sourceforge.net.

Assuntos

Genômica/estatística & dados numéricos , Análise de Sequência de DNA/estatística & dados numéricos , Software , Animais , Biologia Computacional , Gráficos por Computador , Apresentação de Dados , Sequenciamento de Nucleotídeos em Larga Escala/estatística & dados numéricos , Humanos

4.

The draft genome of the transgenic tropical fruit tree papaya (Carica papaya Linnaeus).

Ming, Ray; Hou, Shaobin; Feng, Yun; Yu, Qingyi; Dionne-Laporte, Alexandre; Saw, Jimmy H; Senin, Pavel; Wang, Wei; Ly, Benjamin V; Lewis, Kanako L T; Salzberg, Steven L; Feng, Lu; Jones, Meghan R; Skelton, Rachel L; Murray, Jan E; Chen, Cuixia; Qian, Wubin; Shen, Junguo; Du, Peng; Eustice, Moriah; Tong, Eric; Tang, Haibao; Lyons, Eric; Paull, Robert E; Michael, Todd P; Wall, Kerr; Rice, Danny W; Albert, Henrik; Wang, Ming-Li; Zhu, Yun J; Schatz, Michael; Nagarajan, Niranjan; Acob, Ricelle A; Guan, Peizhu; Blas, Andrea; Wai, Ching Man; Ackerman, Christine M; Ren, Yan; Liu, Chao; Wang, Jianmei; Wang, Jianping; Na, Jong-Kuk; Shakirov, Eugene V; Haas, Brian; Thimmapuram, Jyothi; Nelson, David; Wang, Xiyin; Bowers, John E; Gschwend, Andrea R; Delcher, Arthur L.

Nature ; 452(7190): 991-6, 2008 Apr 24.

Artigo em Inglês | MEDLINE | ID: mdl-18432245

RESUMO

Papaya, a fruit crop cultivated in tropical and subtropical regions, is known for its nutritional benefits and medicinal applications. Here we report a 3x draft genome sequence of 'SunUp' papaya, the first commercial virus-resistant transgenic fruit tree to be sequenced. The papaya genome is three times the size of the Arabidopsis genome, but contains fewer genes, including significantly fewer disease-resistance gene analogues. Comparison of the five sequenced genomes suggests a minimal angiosperm gene set of 13,311. A lack of recent genome duplication, atypical of other angiosperm genomes sequenced so far, may account for the smaller papaya gene number in most functional groups. Nonetheless, striking amplifications in gene number within particular functional groups suggest roles in the evolution of tree-like habit, deposition and remobilization of starch reserves, attraction of seed dispersal agents, and adaptation to tropical daylengths. Transgenesis at three locations is closely associated with chloroplast insertions into the nuclear genome, and with topoisomerase I recognition sites. Papaya offers numerous advantages as a system for fruit-tree functional genomics, and this draft genome sequence provides the foundation for revealing the basis of Carica's distinguishing morpho-physiological, medicinal and nutritional properties.

Assuntos

Carica/genética , Genoma de Planta/genética , Arabidopsis/genética , Mapeamento de Sequências Contíguas , Bases de Dados Genéticas , Genes de Plantas/genética , Dados de Sequência Molecular , Plantas Geneticamente Modificadas/genética , Alinhamento de Sequência , Análise de Sequência de DNA , Fatores de Transcrição/genética , Clima Tropical

5.

Gene prediction with Glimmer for metagenomic sequences augmented by classification and clustering.

Kelley, David R; Liu, Bo; Delcher, Arthur L; Pop, Mihai; Salzberg, Steven L.

Nucleic Acids Res ; 40(1): e9, 2012 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-22102569

RESUMO

Environmental shotgun sequencing (or metagenomics) is widely used to survey the communities of microbial organisms that live in many diverse ecosystems, such as the human body. Finding the protein-coding genes within the sequences is an important step for assessing the functional capacity of a metagenome. In this work, we developed a metagenomics gene prediction system Glimmer-MG that achieves significantly greater accuracy than previous systems via novel approaches to a number of important prediction subtasks. First, we introduce the use of phylogenetic classifications of the sequences to model parameterization. We also cluster the sequences, grouping together those that likely originated from the same organism. Analogous to iterative schemes that are useful for whole genomes, we retrain our models within each cluster on the initial gene predictions before making final predictions. Finally, we model both insertion/deletion and substitution sequencing errors using a different approach than previous software, allowing Glimmer-MG to change coding frame or pass through stop codons by predicting an error. In a comparison among multiple gene finding methods, Glimmer-MG makes the most sensitive and precise predictions on simulated and real metagenomes for all read lengths and error rates tested.

Assuntos

Metagenômica/métodos , Análise de Sequência de DNA , Software , Análise por Conglomerados , Trato Gastrointestinal/microbiologia , Genes , Humanos , Metagenoma , Filogenia

6.

Assembly of large genomes using second-generation sequencing.

Schatz, Michael C; Delcher, Arthur L; Salzberg, Steven L.

Genome Res ; 20(9): 1165-73, 2010 Sep.

Artigo em Inglês | MEDLINE | ID: mdl-20508146

RESUMO

Second-generation sequencing technology can now be used to sequence an entire human genome in a matter of days and at low cost. Sequence read lengths, initially very short, have rapidly increased since the technology first appeared, and we now are seeing a growing number of efforts to sequence large genomes de novo from these short reads. In this Perspective, we describe the issues associated with short-read assembly, the different types of data produced by second-gen sequencers, and the latest assembly algorithms designed for these data. We also review the genomes that have been assembled recently from short reads and make recommendations for sequencing strategies that will yield a high-quality assembly.

Assuntos

Genômica/métodos , Análise de Sequência de DNA/métodos , Algoritmos , Sequência de Bases , Genoma Humano , Humanos

7.

Two new complete genome sequences offer insight into host and tissue specificity of plant pathogenic Xanthomonas spp.

Bogdanove, Adam J; Koebnik, Ralf; Lu, Hong; Furutani, Ayako; Angiuoli, Samuel V; Patil, Prabhu B; Van Sluys, Marie-Anne; Ryan, Robert P; Meyer, Damien F; Han, Sang-Wook; Aparna, Gudlur; Rajaram, Misha; Delcher, Arthur L; Phillippy, Adam M; Puiu, Daniela; Schatz, Michael C; Shumway, Martin; Sommer, Daniel D; Trapnell, Cole; Benahmed, Faiza; Dimitrov, George; Madupu, Ramana; Radune, Diana; Sullivan, Steven; Jha, Gopaljee; Ishihara, Hiromichi; Lee, Sang-Won; Pandey, Alok; Sharma, Vikas; Sriariyanun, Malinee; Szurek, Boris; Vera-Cruz, Casiana M; Dorman, Karin S; Ronald, Pamela C; Verdier, Valérie; Dow, J Maxwell; Sonti, Ramesh V; Tsuge, Seiji; Brendel, Volker P; Rabinowicz, Pablo D; Leach, Jan E; White, Frank F; Salzberg, Steven L.

J Bacteriol ; 193(19): 5450-64, 2011 Oct.

Artigo em Inglês | MEDLINE | ID: mdl-21784931

RESUMO

Xanthomonas is a large genus of bacteria that collectively cause disease on more than 300 plant species. The broad host range of the genus contrasts with stringent host and tissue specificity for individual species and pathovars. Whole-genome sequences of Xanthomonas campestris pv. raphani strain 756C and X. oryzae pv. oryzicola strain BLS256, pathogens that infect the mesophyll tissue of the leading models for plant biology, Arabidopsis thaliana and rice, respectively, were determined and provided insight into the genetic determinants of host and tissue specificity. Comparisons were made with genomes of closely related strains that infect the vascular tissue of the same hosts and across a larger collection of complete Xanthomonas genomes. The results suggest a model in which complex sets of adaptations at the level of gene content account for host specificity and subtler adaptations at the level of amino acid or noncoding regulatory nucleotide sequence determine tissue specificity.

Assuntos

Genoma Bacteriano/genética , Xanthomonas/genética , Arabidopsis/microbiologia , Dados de Sequência Molecular , Oryza/microbiologia , Xanthomonas/fisiologia

8.

Sim4cc: a cross-species spliced alignment program.

Zhou, Leming; Pertea, Mihaela; Delcher, Arthur L; Florea, Liliana.

Nucleic Acids Res ; 37(11): e80, 2009 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-19429899

RESUMO

Advances in sequencing technologies have accelerated the sequencing of new genomes, far outpacing the generation of gene and protein resources needed to annotate them. Direct comparison and alignment of existing cDNA sequences from a related species is an effective and readily available means to determine genes in the new genomes. Current spliced alignment programs are inadequate for comparing sequences between different species, owing to their low sensitivity and splice junction accuracy. A new spliced alignment tool, sim4cc, overcomes problems in the earlier tools by incorporating three new features: universal spaced seeds, to increase sensitivity and allow comparisons between species at various evolutionary distances, and powerful splice signal models and evolutionarily-aware alignment techniques, to improve the accuracy of gene models. When tested on vertebrate comparisons at diverse evolutionary distances, sim4cc had significantly higher sensitivity compared to existing alignment programs, more than 10% higher than the closest competitor for some comparisons, while being comparable in speed to its predecessor, sim4. Sim4cc can be used in one-to-one or one-to-many comparisons of genomic and cDNA sequences, and can also be effectively incorporated into a high-throughput annotation engine, as demonstrated by the mapping of 64,000 Fagus grandifolia 454 ESTs and unigenes to the poplar genome.

Assuntos

Genômica/métodos , Splicing de RNA , Alinhamento de Sequência/métodos , Software , Algoritmos , Animais , Cães , Genoma de Planta , Humanos , Camundongos , Padrões de Referência , Alinhamento de Sequência/normas , Vertebrados/genética

9.

Genome sequence of the dioxin-mineralizing bacterium Sphingomonas wittichii RW1.

Miller, Todd R; Delcher, Arthur L; Salzberg, Steven L; Saunders, Elizabeth; Detter, John C; Halden, Rolf U.

J Bacteriol ; 192(22): 6101-2, 2010 Nov.

Artigo em Inglês | MEDLINE | ID: mdl-20833805

RESUMO

Pollutants such as polychlorinated biphenyls and dioxins pose a serious threat to human and environmental health. Natural attenuation of these compounds by microorganisms provides one promising avenue for their removal from contaminated areas. Over the past 2 decades, studies of the bacterium Sphingomonas wittichii RW1 have provided a wealth of knowledge about how bacteria metabolize chlorinated aromatic hydrocarbons. Here we describe the finished genome sequence of S. wittichii RW1 and major findings from its annotation.

Assuntos

DNA Bacteriano/química , DNA Bacteriano/genética , Dioxinas/metabolismo , Genoma Bacteriano , Sphingomonas/genética , Sphingomonas/metabolismo , Poluentes Ambientais/metabolismo , Dados de Sequência Molecular , Análise de Sequência de DNA

10.

Aggressive assembly of pyrosequencing reads with mates.

Miller, Jason R; Delcher, Arthur L; Koren, Sergey; Venter, Eli; Walenz, Brian P; Brownley, Anushka; Johnson, Justin; Li, Kelvin; Mobarry, Clark; Sutton, Granger.

Bioinformatics ; 24(24): 2818-24, 2008 Dec 15.

Artigo em Inglês | MEDLINE | ID: mdl-18952627

RESUMO

MOTIVATION: DNA sequence reads from Sanger and pyrosequencing platforms differ in cost, accuracy, typical coverage, average read length and the variety of available paired-end protocols. Both read types can complement one another in a 'hybrid' approach to whole-genome shotgun sequencing projects, but assembly software must be modified to accommodate their different characteristics. This is true even of pyrosequencing mated and unmated read combinations. Without special modifications, assemblers tuned for homogeneous sequence data may perform poorly on hybrid data. RESULTS: Celera Assembler was modified for combinations of ABI 3730 and 454 FLX reads. The revised pipeline called CABOG (Celera Assembler with the Best Overlap Graph) is robust to homopolymer run length uncertainty, high read coverage and heterogeneous read lengths. In tests on four genomes, it generated the longest contigs among all assemblers tested. It exploited the mate constraints provided by paired-end reads from either platform to build larger contigs and scaffolds, which were validated by comparison to a finished reference sequence. A low rate of contig mis-assembly was detected in some CABOG assemblies, but this was reduced in the presence of sufficient mate pair data. AVAILABILITY: The software is freely available as open-source from http://wgs-assembler.sf.net under the GNU Public License.

Assuntos

Análise de Sequência de DNA/métodos , Software , Biologia Computacional/métodos , Genoma , Genômica

11.

Macronuclear genome sequence of the ciliate Tetrahymena thermophila, a model eukaryote.

Eisen, Jonathan A; Coyne, Robert S; Wu, Martin; Wu, Dongying; Thiagarajan, Mathangi; Wortman, Jennifer R; Badger, Jonathan H; Ren, Qinghu; Amedeo, Paolo; Jones, Kristie M; Tallon, Luke J; Delcher, Arthur L; Salzberg, Steven L; Silva, Joana C; Haas, Brian J; Majoros, William H; Farzad, Maryam; Carlton, Jane M; Smith, Roger K; Garg, Jyoti; Pearlman, Ronald E; Karrer, Kathleen M; Sun, Lei; Manning, Gerard; Elde, Nels C; Turkewitz, Aaron P; Asai, David J; Wilkes, David E; Wang, Yufeng; Cai, Hong; Collins, Kathleen; Stewart, B Andrew; Lee, Suzanne R; Wilamowska, Katarzyna; Weinberg, Zasha; Ruzzo, Walter L; Wloga, Dorota; Gaertig, Jacek; Frankel, Joseph; Tsao, Che-Chia; Gorovsky, Martin A; Keeling, Patrick J; Waller, Ross F; Patron, Nicola J; Cherry, J Michael; Stover, Nicholas A; Krieger, Cynthia J; del Toro, Christina; Ryder, Hilary F; Williamson, Sondra C.

PLoS Biol ; 4(9): e286, 2006 Sep.

Artigo em Inglês | MEDLINE | ID: mdl-16933976

RESUMO

The ciliate Tetrahymena thermophila is a model organism for molecular and cellular biology. Like other ciliates, this species has separate germline and soma functions that are embodied by distinct nuclei within a single cell. The germline-like micronucleus (MIC) has its genome held in reserve for sexual reproduction. The soma-like macronucleus (MAC), which possesses a genome processed from that of the MIC, is the center of gene expression and does not directly contribute DNA to sexual progeny. We report here the shotgun sequencing, assembly, and analysis of the MAC genome of T. thermophila, which is approximately 104 Mb in length and composed of approximately 225 chromosomes. Overall, the gene set is robust, with more than 27,000 predicted protein-coding genes, 15,000 of which have strong matches to genes in other organisms. The functional diversity encoded by these genes is substantial and reflects the complexity of processes required for a free-living, predatory, single-celled organism. This is highlighted by the abundance of lineage-specific duplications of genes with predicted roles in sensing and responding to environmental conditions (e.g., kinases), using diverse resources (e.g., proteases and transporters), and generating structural complexity (e.g., kinesins and dyneins). In contrast to the other lineages of alveolates (apicomplexans and dinoflagellates), no compelling evidence could be found for plastid-derived genes in the genome. UGA, the only T. thermophila stop codon, is used in some genes to encode selenocysteine, thus making this organism the first known with the potential to translate all 64 codons in nuclear genes into amino acids. We present genomic evidence supporting the hypothesis that the excision of DNA from the MIC to generate the MAC specifically targets foreign DNA as a form of genome self-defense. The combination of the genome sequence, the functional diversity encoded therein, and the presence of some pathways missing from other model organisms makes T. thermophila an ideal model for functional genomic studies to address biological, biomedical, and biotechnological questions of fundamental importance.

Assuntos

Genoma de Protozoário , Macronúcleo/genética , Modelos Biológicos , Tetrahymena thermophila/genética , Animais , Células Cultivadas , Mapeamento Cromossômico/métodos , Cromossomos , Bases de Dados Genéticas , Células Eucarióticas/fisiologia , Evolução Molecular , Micronúcleo Germinativo/genética , Modelos Animais , Filogenia , Transdução de Sinais

12.

Genome sequence and rapid evolution of the rice pathogen Xanthomonas oryzae pv. oryzae PXO99A.

Salzberg, Steven L; Sommer, Daniel D; Schatz, Michael C; Phillippy, Adam M; Rabinowicz, Pablo D; Tsuge, Seiji; Furutani, Ayako; Ochiai, Hirokazu; Delcher, Arthur L; Kelley, David; Madupu, Ramana; Puiu, Daniela; Radune, Diana; Shumway, Martin; Trapnell, Cole; Aparna, Gudlur; Jha, Gopaljee; Pandey, Alok; Patil, Prabhu B; Ishihara, Hiromichi; Meyer, Damien F; Szurek, Boris; Verdier, Valerie; Koebnik, Ralf; Dow, J Maxwell; Ryan, Robert P; Hirata, Hisae; Tsuyumu, Shinji; Won Lee, Sang; Seo, Young-Su; Sriariyanum, Malinee; Ronald, Pamela C; Sonti, Ramesh V; Van Sluys, Marie-Anne; Leach, Jan E; White, Frank F; Bogdanove, Adam J.

BMC Genomics ; 9: 204, 2008 May 01.

Artigo em Inglês | MEDLINE | ID: mdl-18452608

RESUMO

BACKGROUND: Xanthomonas oryzae pv. oryzae causes bacterial blight of rice (Oryza sativa L.), a major disease that constrains production of this staple crop in many parts of the world. We report here on the complete genome sequence of strain PXO99A and its comparison to two previously sequenced strains, KACC10331 and MAFF311018, which are highly similar to one another. RESULTS: The PXO99A genome is a single circular chromosome of 5,240,075 bp, considerably longer than the genomes of the other strains (4,941,439 bp and 4,940,217 bp, respectively), and it contains 5083 protein-coding genes, including 87 not found in KACC10331 or MAFF311018. PXO99A contains a greater number of virulence-associated transcription activator-like effector genes and has at least ten major chromosomal rearrangements relative to KACC10331 and MAFF311018. PXO99A contains numerous copies of diverse insertion sequence elements, members of which are associated with 7 out of 10 of the major rearrangements. A rapidly-evolving CRISPR (clustered regularly interspersed short palindromic repeats) region contains evidence of dozens of phage infections unique to the PXO99A lineage. PXO99A also contains a unique, near-perfect tandem repeat of 212 kilobases close to the replication terminus. CONCLUSION: Our results provide striking evidence of genome plasticity and rapid evolution within Xanthomonas oryzae pv. oryzae. The comparisons point to sources of genomic variation and candidates for strain-specific adaptations of this pathogen that help to explain the extraordinary diversity of Xanthomonas oryzae pv. oryzae genotypes and races that have been isolated from around the world.

Assuntos

Evolução Molecular , Genoma Bacteriano/genética , Oryza/microbiologia , Xanthomonas/genética , Proteínas de Bactérias/genética , Sequência de Bases , Elementos de DNA Transponíveis/genética , Duplicação Gênica , Rearranjo Gênico , Transferência Genética Horizontal , Genômica , Repetições de Microssatélites , Reprodutibilidade dos Testes , Fatores de Tempo

13.

Identifying bacterial genes and endosymbiont DNA with Glimmer.

Delcher, Arthur L; Bratke, Kirsten A; Powers, Edwin C; Salzberg, Steven L.

Bioinformatics ; 23(6): 673-9, 2007 Mar 15.

Artigo em Inglês | MEDLINE | ID: mdl-17237039

RESUMO

MOTIVATION: The Glimmer gene-finding software has been successfully used for finding genes in bacteria, archaea and viruses representing hundreds of species. We describe several major changes to the Glimmer system, including improved methods for identifying both coding regions and start codons. We also describe a new module of Glimmer that can distinguish host and endosymbiont DNA. This module was developed in response to the discovery that eukaryotic genome sequencing projects sometimes inadvertently capture the DNA of intracellular bacteria living in the host. RESULTS: The new methods dramatically reduce the rate of false-positive predictions, while maintaining Glimmer's 99% sensitivity rate at detecting genes in most species, and they find substantially more correct start sites, as measured by comparisons to known and well-curated genes. We show that our interpolated Markov model (IMM) DNA discriminator correctly separated 99% of the sequences in a recent genome project that produced a mixture of sequences from the bacterium Prochloron didemni and its sea squirt host, Lissoclinum patella. AVAILABILITY: Glimmer is OSI Certified Open Source and available at http://cbcb.umd.edu/software/glimmer.

Assuntos

Mapeamento Cromossômico/métodos , DNA Bacteriano/genética , Decapodiformes/genética , Genoma Bacteriano/genética , Prochloron/genética , Software , Simbiose/genética , Algoritmos , Animais , Artefatos , Decapodiformes/microbiologia , Reprodutibilidade dos Testes , Sensibilidade e Especificidade , Análise de Sequência de DNA/métodos

14.

Minimus: a fast, lightweight genome assembler.

Sommer, Daniel D; Delcher, Arthur L; Salzberg, Steven L; Pop, Mihai.

BMC Bioinformatics ; 8: 64, 2007 Feb 26.

Artigo em Inglês | MEDLINE | ID: mdl-17324286

RESUMO

BACKGROUND: Genome assemblers have grown very large and complex in response to the need for algorithms to handle the challenges of large whole-genome sequencing projects. Many of the most common uses of assemblers, however, are best served by a simpler type of assembler that requires fewer software components, uses less memory, and is far easier to install and run. RESULTS: We have developed the Minimus assembler to address these issues, and tested it on a range of assembly problems. We show that Minimus performs well on several small assembly tasks, including the assembly of viral genomes, individual genes, and BAC clones. In addition, we evaluate Minimus' performance in assembling bacterial genomes in order to assess its suitability as a component of a larger assembly pipeline. We show that, unlike other software currently used for these tasks, Minimus produces significantly fewer assembly errors, at the cost of generating a more fragmented assembly. CONCLUSION: We find that for small genomes and other small assembly tasks, Minimus is faster and far more flexible than existing tools. Due to its small size and modular design Minimus is perfectly suited to be a component of complex assembly pipelines. Minimus is released as an open-source software project and the code is available as part of the AMOS project at Sourceforge.

Assuntos

Algoritmos , Mapeamento Cromossômico/métodos , DNA/química , DNA/genética , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Software , Sequência de Bases , Dados de Sequência Molecular , Design de Software , Interface Usuário-Computador

15.

High-throughput sequence alignment using Graphics Processing Units.

Schatz, Michael C; Trapnell, Cole; Delcher, Arthur L; Varshney, Amitabh.

BMC Bioinformatics ; 8: 474, 2007 Dec 10.

Artigo em Inglês | MEDLINE | ID: mdl-18070356

RESUMO

BACKGROUND: The recent availability of new, less expensive high-throughput DNA sequencing technologies has yielded a dramatic increase in the volume of sequence data that must be analyzed. These data are being generated for several purposes, including genotyping, genome resequencing, metagenomics, and de novo genome assembly projects. Sequence alignment programs such as MUMmer have proven essential for analysis of these data, but researchers will need ever faster, high-throughput alignment tools running on inexpensive hardware to keep up with new sequence technologies. RESULTS: This paper describes MUMmerGPU, an open-source high-throughput parallel pairwise local sequence alignment program that runs on commodity Graphics Processing Units (GPUs) in common workstations. MUMmerGPU uses the new Compute Unified Device Architecture (CUDA) from nVidia to align multiple query sequences against a single reference sequence stored as a suffix tree. By processing the queries in parallel on the highly parallel graphics card, MUMmerGPU achieves more than a 10-fold speedup over a serial CPU version of the sequence alignment kernel, and outperforms the exact alignment component of MUMmer on a high end CPU by 3.5-fold in total application time when aligning reads from recent sequencing projects using Solexa/Illumina, 454, and Sanger sequencing technologies. CONCLUSION: MUMmerGPU is a low cost, ultra-fast sequence alignment program designed to handle the increasing volume of data produced by new, high-throughput sequencing technologies. MUMmerGPU demonstrates that even memory-intensive applications can run significantly faster on the relatively low-cost GPU than on the CPU.

Assuntos

Gráficos por Computador/instrumentação , Sistemas de Gerenciamento de Base de Dados , Alinhamento de Sequência/economia , Alinhamento de Sequência/instrumentação , Animais , Bacillus anthracis/genética , Sequência de Bases , Caenorhabditis/genética , Gráficos por Computador/economia , Computadores/economia , Mapeamento de Sequências Contíguas/economia , Mapeamento de Sequências Contíguas/instrumentação , DNA/ultraestrutura , Bases de Dados Genéticas , Biblioteca Genômica , Listeria monocytogenes/genética , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/economia , Análise de Sequência de DNA/instrumentação , Análise de Sequência de DNA/métodos , Streptococcus suis/genética , Fatores de Tempo , Simplificação do Trabalho

16.

New var reconstruction algorithm exposes high var sequence diversity in a single geographic location in Mali.

Dara, Antoine; Drábek, Elliott F; Travassos, Mark A; Moser, Kara A; Delcher, Arthur L; Su, Qi; Hostelley, Timothy; Coulibaly, Drissa; Daou, Modibo; Dembele, Ahmadou; Diarra, Issa; Kone, Abdoulaye K; Kouriba, Bourema; Laurens, Matthew B; Niangaly, Amadou; Traore, Karim; Tolo, Youssouf; Fraser, Claire M; Thera, Mahamadou A; Djimde, Abdoulaye A; Doumbo, Ogobara K; Plowe, Christopher V; Silva, Joana C.

Genome Med ; 9(1): 30, 2017 03 28.

Artigo em Inglês | MEDLINE | ID: mdl-28351419

RESUMO

BACKGROUND: Encoded by the var gene family, highly variable Plasmodium falciparum erythrocyte membrane protein-1 (PfEMP1) proteins mediate tissue-specific cytoadherence of infected erythrocytes, resulting in immune evasion and severe malaria disease. Sequencing and assembling the 40-60 var gene complement for individual infections has been notoriously difficult, impeding molecular epidemiological studies and the assessment of particular var elements as subunit vaccine candidates. METHODS: We developed and validated a novel algorithm, Exon-Targeted Hybrid Assembly (ETHA), to perform targeted assembly of var gene sequences, based on a combination of Pacific Biosciences and Illumina data. RESULTS: Using ETHA, we characterized the repertoire of var genes in 12 samples from uncomplicated malaria infections in children from a single Malian village and showed them to be as genetically diverse as vars from isolates from around the globe. The gene var2csa, a member of the var family associated with placental malaria pathogenesis, was present in each genome, as were vars previously associated with severe malaria. CONCLUSION: ETHA, a tool to discover novel var sequences from clinical samples, will aid the understanding of malaria pathogenesis and inform the design of malaria vaccines based on PfEMP1. ETHA is available at: https://sourceforge.net/projects/etha/ .

Assuntos

Algoritmos , Variação Genética , Plasmodium falciparum/metabolismo , Proteínas de Protozoários/genética , Análise de Sequência de DNA/métodos , Criança , Humanos , Malária Falciparum/genética , Malária Falciparum/metabolismo , Mali , Plasmodium falciparum/genética , Software

17.

PROBEmer: A web-based software tool for selecting optimal DNA oligos.

Emrich, Scott J; Lowe, Mary; Delcher, Arthur L.

Nucleic Acids Res ; 31(13): 3746-50, 2003 Jul 01.

Artigo em Inglês | MEDLINE | ID: mdl-12824409

RESUMO

PROBEmer (http://probemer.cs.loyola.edu) is a web-based software tool that enables a researcher to select optimal oligos for PCR applications and multiplex detection platforms including oligonucleotide microarrays and bead-based arrays. Given two groups of nucleic-acid sequences, a target group and a non-target group, the software identifies oligo sequences that occur in members of the target group, but not in the non-target group. To help predict potential cross hybridization, PROBEmer computes all near neighbors in the non-target group and displays their alignments. The software has been used to obtain genus-specific prokaryotic probes based on the 16S rRNA gene, gene-specific probes for expression analyses and PCR primers. In this paper, we describe how to use PROBEmer, the computational methods it employs, and experimental results for oligos identified by this software tool.

Assuntos

Sondas de DNA/química , Técnicas de Sonda Molecular , Sondas de Oligonucleotídeos/química , Análise de Sequência de DNA/métodos , Software , Algoritmos , Bactérias/genética , Biologia Computacional , Primers do DNA/química , Bases de Dados Genéticas , Perfilação da Expressão Gênica/métodos , Genoma Fúngico , Internet , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Fases de Leitura Aberta , Reação em Cadeia da Polimerase/métodos , RNA Ribossômico 16S/genética , Saccharomyces cerevisiae/genética , Interface Usuário-Computador

18.

Fast algorithms for large-scale genome alignment and comparison.

Delcher, Arthur L; Phillippy, Adam; Carlton, Jane; Salzberg, Steven L.

Nucleic Acids Res ; 30(11): 2478-83, 2002 Jun 01.

Artigo em Inglês | MEDLINE | ID: mdl-12034836

RESUMO

We describe a suffix-tree algorithm that can align the entire genome sequences of eukaryotic and prokaryotic organisms with minimal use of computer time and memory. The new system, MUMmer 2, runs three times faster while using one-third as much memory as the original MUMmer system. It has been used successfully to align the entire human and mouse genomes to each other, and to align numerous smaller eukaryotic and prokaryotic genomes. A new module permits the alignment of multiple DNA sequence fragments, which has proven valuable in the comparison of incomplete genome sequences. We also describe a method to align more distantly related genomes by detecting protein sequence homology. This extension to MUMmer aligns two genomes after translating the sequence in all six reading frames, extracts all matching protein sequences and then clusters together matches. This method has been applied to both incomplete and complete genome sequences in order to detect regions of conserved synteny, in which multiple proteins from one organism are found in the same order and orientation in another. The system code is being made freely available by the authors.

Assuntos

Algoritmos , Biologia Computacional/métodos , Genoma , Genômica/métodos , Alinhamento de Sequência/métodos , Animais , Computadores , Mapeamento de Sequências Contíguas , Genoma Bacteriano , Genoma Humano , Genoma de Protozoário , Humanos , Plasmodium/genética , Software , Fatores de Tempo

19.

Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies.

Haas, Brian J; Delcher, Arthur L; Mount, Stephen M; Wortman, Jennifer R; Smith, Roger K; Hannick, Linda I; Maiti, Rama; Ronning, Catherine M; Rusch, Douglas B; Town, Christopher D; Salzberg, Steven L; White, Owen.

Nucleic Acids Res ; 31(19): 5654-66, 2003 Oct 01.

Artigo em Inglês | MEDLINE | ID: mdl-14500829

RESUMO

The spliced alignment of expressed sequence data to genomic sequence has proven a key tool in the comprehensive annotation of genes in eukaryotic genomes. A novel algorithm was developed to assemble clusters of overlapping transcript alignments (ESTs and full-length cDNAs) into maximal alignment assemblies, thereby comprehensively incorporating all available transcript data and capturing subtle splicing variations. Complete and partial gene structures identified by this method were used to improve The Institute for Genomic Research Arabidopsis genome annotation (TIGR release v.4.0). The alignment assemblies permitted the automated modeling of several novel genes and >1000 alternative splicing variations as well as updates (including UTR annotations) to nearly half of the approximately 27 000 annotated protein coding genes. The algorithm of the Program to Assemble Spliced Alignments (PASA) tool is described, as well as the results of automated updates to Arabidopsis gene annotations.

Assuntos

Arabidopsis/genética , Genoma de Planta , RNA de Plantas/análise , Alinhamento de Sequência/métodos , Software , Algoritmos , Processamento Alternativo , Arabidopsis/metabolismo , DNA Complementar/análise , Etiquetas de Sequências Expressas , Íntrons , Proteínas de Plantas/genética , RNA de Plantas/química , Transcrição Gênica , Regiões não Traduzidas

20.

Efficient decoding algorithms for generalized hidden Markov model gene finders.

Majoros, William H; Pertea, Mihaela; Delcher, Arthur L; Salzberg, Steven L.

BMC Bioinformatics ; 6: 16, 2005 Jan 24.

Artigo em Inglês | MEDLINE | ID: mdl-15667658

RESUMO

BACKGROUND: The Generalized Hidden Markov Model (GHMM) has proven a useful framework for the task of computational gene prediction in eukaryotic genomes, due to its flexibility and probabilistic underpinnings. As the focus of the gene finding community shifts toward the use of homology information to improve prediction accuracy, extensions to the basic GHMM model are being explored as possible ways to integrate this homology information into the prediction process. Particularly prominent among these extensions are those techniques which call for the simultaneous prediction of genes in two or more genomes at once, thereby increasing significantly the computational cost of prediction and highlighting the importance of speed and memory efficiency in the implementation of the underlying GHMM algorithms. Unfortunately, the task of implementing an efficient GHMM-based gene finder is already a nontrivial one, and it can be expected that this task will only grow more onerous as our models increase in complexity. RESULTS: As a first step toward addressing the implementation challenges of these next-generation systems, we describe in detail two software architectures for GHMM-based gene finders, one comprising the common array-based approach, and the other a highly optimized algorithm which requires significantly less memory while achieving virtually identical speed. We then show how both of these architectures can be accelerated by a factor of two by optimizing their content sensors. We finish with a brief illustration of the impact these optimizations have had on the feasibility of our new homology-based gene finder, TWAIN. CONCLUSIONS: In describing a number of optimizations for GHMM-based gene finders and making available two complete open-source software systems embodying these methods, it is our hope that others will be more enabled to explore promising extensions to the GHMM framework, thereby improving the state-of-the-art in gene prediction techniques.

Assuntos

Biologia Computacional/métodos , Regulação da Expressão Gênica , Algoritmos , Teorema de Bayes , Simulação por Computador , DNA/química , Bases de Dados Genéticas , Perfilação da Expressão Gênica , Genoma , Genoma Humano , Humanos , Funções Verossimilhança , Cadeias de Markov , Modelos Biológicos , Modelos Genéticos , Modelos Estatísticos , Probabilidade , Linguagens de Programação , Alinhamento de Sequência , Análise de Sequência de DNA , Software

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA