Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 38
Filtrar
Más filtros










Base de datos
Intervalo de año de publicación
1.
PLoS Comput Biol ; 14(1): e1005944, 2018 01.
Artículo en Inglés | MEDLINE | ID: mdl-29373581

RESUMEN

The MUMmer system and the genome sequence aligner nucmer included within it are among the most widely used alignment packages in genomics. Since the last major release of MUMmer version 3 in 2004, it has been applied to many types of problems including aligning whole genome sequences, aligning reads to a reference genome, and comparing different assemblies of the same genome. Despite its broad utility, MUMmer3 has limitations that can make it difficult to use for large genomes and for the very large sequence data sets that are common today. In this paper we describe MUMmer4, a substantially improved version of MUMmer that addresses genome size constraints by changing the 32-bit suffix tree data structure at the core of MUMmer to a 48-bit suffix array, and that offers improved speed through parallel processing of input query sequences. With a theoretical limit on the input size of 141Tbp, MUMmer4 can now work with input sequences of any biologically realistic length. We show that as a result of these enhancements, the nucmer program in MUMmer4 is easily able to handle alignments of large genomes; we illustrate this with an alignment of the human and chimpanzee genomes, which allows us to compute that the two species are 98% identical across 96% of their length. With the enhancements described here, MUMmer4 can also be used to efficiently align reads to reference genomes, although it is less sensitive and accurate than the dedicated read aligners. The nucmer aligner in MUMmer4 can now be called from scripting languages such as Perl, Python and Ruby. These improvements make MUMer4 one the most versatile genome alignment packages available.


Asunto(s)
Biología Computacional/métodos , Alineación de Secuencia/métodos , Programas Informáticos , Algoritmos , Animales , Arabidopsis/genética , Genoma Humano , Genoma de Planta , Genómica , Humanos , Modelos Teóricos , Pan troglodytes , Polimorfismo de Nucleótido Simple , Lenguajes de Programación , Análisis de Secuencia de ADN , Análisis de Secuencia de Proteína
2.
Genome Med ; 9(1): 30, 2017 03 28.
Artículo en Inglés | MEDLINE | ID: mdl-28351419

RESUMEN

BACKGROUND: Encoded by the var gene family, highly variable Plasmodium falciparum erythrocyte membrane protein-1 (PfEMP1) proteins mediate tissue-specific cytoadherence of infected erythrocytes, resulting in immune evasion and severe malaria disease. Sequencing and assembling the 40-60 var gene complement for individual infections has been notoriously difficult, impeding molecular epidemiological studies and the assessment of particular var elements as subunit vaccine candidates. METHODS: We developed and validated a novel algorithm, Exon-Targeted Hybrid Assembly (ETHA), to perform targeted assembly of var gene sequences, based on a combination of Pacific Biosciences and Illumina data. RESULTS: Using ETHA, we characterized the repertoire of var genes in 12 samples from uncomplicated malaria infections in children from a single Malian village and showed them to be as genetically diverse as vars from isolates from around the globe. The gene var2csa, a member of the var family associated with placental malaria pathogenesis, was present in each genome, as were vars previously associated with severe malaria. CONCLUSION: ETHA, a tool to discover novel var sequences from clinical samples, will aid the understanding of malaria pathogenesis and inform the design of malaria vaccines based on PfEMP1. ETHA is available at: https://sourceforge.net/projects/etha/ .


Asunto(s)
Algoritmos , Variación Genética , Plasmodium falciparum/metabolismo , Proteínas Protozoarias/genética , Análisis de Secuencia de ADN/métodos , Niño , Humanos , Malaria Falciparum/genética , Malaria Falciparum/metabolismo , Malí , Plasmodium falciparum/genética , Programas Informáticos
3.
Brief Bioinform ; 14(2): 213-24, 2013 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-22199379

RESUMEN

Since its launch in 2004, the open-source AMOS project has released several innovative DNA sequence analysis applications including: Hawkeye, a visual analytics tool for inspecting the structure of genome assemblies; the Assembly Forensics and FRCurve pipelines for systematically evaluating the quality of a genome assembly; and AMOScmp, the first comparative genome assembler. These applications have been used to assemble and analyze dozens of genomes ranging in complexity from simple microbial species through mammalian genomes. Recent efforts have been focused on enhancing support for new data characteristics brought on by second- and now third-generation sequencing. This review describes the major components of AMOS in light of these challenges, with an emphasis on methods for assessing assembly quality and the visual analytics capabilities of Hawkeye. These interactive graphical aspects are essential for navigating and understanding the complexities of a genome assembly, from the overall genome structure down to individual bases. Hawkeye and AMOS are available open source at http://amos.sourceforge.net.


Asunto(s)
Genómica/estadística & datos numéricos , Análisis de Secuencia de ADN/estadística & datos numéricos , Programas Informáticos , Animales , Biología Computacional , Gráficos por Computador , Presentación de Datos , Secuenciación de Nucleótidos de Alto Rendimiento/estadística & datos numéricos , Humanos
4.
Genome Res ; 22(3): 557-67, 2012 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-22147368

RESUMEN

New sequencing technology has dramatically altered the landscape of whole-genome sequencing, allowing scientists to initiate numerous projects to decode the genomes of previously unsequenced organisms. The lowest-cost technology can generate deep coverage of most species, including mammals, in just a few days. The sequence data generated by one of these projects consist of millions or billions of short DNA sequences (reads) that range from 50 to 150 nt in length. These sequences must then be assembled de novo before most genome analyses can begin. Unfortunately, genome assembly remains a very difficult problem, made more difficult by shorter reads and unreliable long-range linking information. In this study, we evaluated several of the leading de novo assembly algorithms on four different short-read data sets, all generated by Illumina sequencers. Our results describe the relative performance of the different assemblers as well as other significant differences in assembly difficulty that appear to be inherent in the genomes themselves. Three overarching conclusions are apparent: first, that data quality, rather than the assembler itself, has a dramatic effect on the quality of an assembled genome; second, that the degree of contiguity of an assembly varies enormously among different assemblers and different genomes; and third, that the correctness of an assembly also varies widely and is not well correlated with statistics on contiguity. To enable others to replicate our results, all of our data and methods are freely available, as are all assemblers used in this study.


Asunto(s)
Algoritmos , Genómica/métodos , Análisis de Secuencia de ADN , Animales , Biología Computacional/métodos , Genoma , Genoma Bacteriano/genética , Humanos , Internet , Reproducibilidad de los Resultados
5.
Nucleic Acids Res ; 40(1): e9, 2012 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-22102569

RESUMEN

Environmental shotgun sequencing (or metagenomics) is widely used to survey the communities of microbial organisms that live in many diverse ecosystems, such as the human body. Finding the protein-coding genes within the sequences is an important step for assessing the functional capacity of a metagenome. In this work, we developed a metagenomics gene prediction system Glimmer-MG that achieves significantly greater accuracy than previous systems via novel approaches to a number of important prediction subtasks. First, we introduce the use of phylogenetic classifications of the sequences to model parameterization. We also cluster the sequences, grouping together those that likely originated from the same organism. Analogous to iterative schemes that are useful for whole genomes, we retrain our models within each cluster on the initial gene predictions before making final predictions. Finally, we model both insertion/deletion and substitution sequencing errors using a different approach than previous software, allowing Glimmer-MG to change coding frame or pass through stop codons by predicting an error. In a comparison among multiple gene finding methods, Glimmer-MG makes the most sensitive and precise predictions on simulated and real metagenomes for all read lengths and error rates tested.


Asunto(s)
Metagenómica/métodos , Análisis de Secuencia de ADN , Programas Informáticos , Análisis por Conglomerados , Tracto Gastrointestinal/microbiología , Genes , Humanos , Metagenoma , Filogenia
6.
J Bacteriol ; 193(19): 5450-64, 2011 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-21784931

RESUMEN

Xanthomonas is a large genus of bacteria that collectively cause disease on more than 300 plant species. The broad host range of the genus contrasts with stringent host and tissue specificity for individual species and pathovars. Whole-genome sequences of Xanthomonas campestris pv. raphani strain 756C and X. oryzae pv. oryzicola strain BLS256, pathogens that infect the mesophyll tissue of the leading models for plant biology, Arabidopsis thaliana and rice, respectively, were determined and provided insight into the genetic determinants of host and tissue specificity. Comparisons were made with genomes of closely related strains that infect the vascular tissue of the same hosts and across a larger collection of complete Xanthomonas genomes. The results suggest a model in which complex sets of adaptations at the level of gene content account for host specificity and subtler adaptations at the level of amino acid or noncoding regulatory nucleotide sequence determine tissue specificity.


Asunto(s)
Genoma Bacteriano/genética , Xanthomonas/genética , Arabidopsis/microbiología , Datos de Secuencia Molecular , Oryza/microbiología , Xanthomonas/fisiología
7.
PLoS One ; 6(3): e14792, 2011 Mar 31.
Artículo en Inglés | MEDLINE | ID: mdl-21483493

RESUMEN

Comparative genomic sequencing is shedding new light on bacterial identification, taxonomy and phylogeny. An in silico assessment of a core gene set necessary for cellular functioning was made to determine a consensus set of genes that would be useful for the identification, taxonomy and phylogeny of the species belonging to the subclass Actinobacteridae which contained two orders Actinomycetales and Bifidobacteriales. The subclass Actinobacteridae comprised about 85% of the actinobacteria families. The following recommended criteria were used to establish a comprehensive gene set; the gene should (i) be long enough to contain phylogenetically useful information, (ii) not be subject to horizontal gene transfer, (iii) be a single copy (iv) have at least two regions sufficiently conserved that allow the design of amplification and sequencing primers and (v) predict whole-genome relationships. We applied these constraints to 50 different Actinobacteridae genomes and made 1,224 pairwise comparisons of the genome conserved regions and gene fragments obtained by using Sequence VARiability Analysis Program (SVARAP), which allow designing the primers. Following a comparative statistical modeling phase, 3 gene fragments were selected, ychF, rpoB, and secY with R2>0.85. Selected sets of broad range primers were tested from the 3 gene fragments and were demonstrated to be useful for amplification and sequencing of 25 species belonging to 9 genera of Actinobacteridae. The intraspecies similarities were 96.3-100% for ychF, 97.8-100% for rpoB and 96.9-100% for secY among 73 strains belonging to 15 species of the subclass Actinobacteridae compare to 99.4-100% for 16S rRNA. The phylogenetic topology obtained from the combined datasets ychF+rpoB+secY was globally similar to that inferred from the 16S rRNA but with higher confidence. It was concluded that multi-locus sequence analysis using core gene set might represent the first consensus and valid approach for investigating the bacterial identification, phylogeny and taxonomy.


Asunto(s)
Actinobacteria/genética , Proteínas Bacterianas/genética , Tipificación de Secuencias Multilocus/métodos , Actinobacteria/clasificación , Proteínas Bacterianas/clasificación , Filogenia
8.
Nat Genet ; 43(2): 109-16, 2011 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-21186353

RESUMEN

The woodland strawberry, Fragaria vesca (2n = 2x = 14), is a versatile experimental plant system. This diminutive herbaceous perennial has a small genome (240 Mb), is amenable to genetic transformation and shares substantial sequence identity with the cultivated strawberry (Fragaria × ananassa) and other economically important rosaceous plants. Here we report the draft F. vesca genome, which was sequenced to ×39 coverage using second-generation technology, assembled de novo and then anchored to the genetic linkage map into seven pseudochromosomes. This diploid strawberry sequence lacks the large genome duplications seen in other rosids. Gene prediction modeling identified 34,809 genes, with most being supported by transcriptome mapping. Genes critical to valuable horticultural traits including flavor, nutritional value and flowering time were identified. Macrosyntenic relationships between Fragaria and Prunus predict a hypothetical ancestral Rosaceae genome that had nine chromosomes. New phylogenetic analysis of 154 protein-coding genes suggests that assignment of Populus to Malvidae, rather than Fabidae, is warranted.


Asunto(s)
Fragaria/genética , Genoma de Planta , Algoritmos , Cloroplastos/genética , Mapeo Cromosómico , Perfilación de la Expresión Génica , Genes de Plantas , Ligamiento Genético , Hibridación Fluorescente in Situ , Funciones de Verosimilitud , Modelos Genéticos , Filogenia , Secuencias Repetidas Terminales , Transcripción Genética
9.
J Bacteriol ; 192(22): 6101-2, 2010 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-20833805

RESUMEN

Pollutants such as polychlorinated biphenyls and dioxins pose a serious threat to human and environmental health. Natural attenuation of these compounds by microorganisms provides one promising avenue for their removal from contaminated areas. Over the past 2 decades, studies of the bacterium Sphingomonas wittichii RW1 have provided a wealth of knowledge about how bacteria metabolize chlorinated aromatic hydrocarbons. Here we describe the finished genome sequence of S. wittichii RW1 and major findings from its annotation.


Asunto(s)
ADN Bacteriano/química , ADN Bacteriano/genética , Dioxinas/metabolismo , Genoma Bacteriano , Sphingomonas/genética , Sphingomonas/metabolismo , Contaminantes Ambientales/metabolismo , Datos de Secuencia Molecular , Análisis de Secuencia de ADN
10.
Genome Res ; 20(9): 1165-73, 2010 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-20508146

RESUMEN

Second-generation sequencing technology can now be used to sequence an entire human genome in a matter of days and at low cost. Sequence read lengths, initially very short, have rapidly increased since the technology first appeared, and we now are seeing a growing number of efforts to sequence large genomes de novo from these short reads. In this Perspective, we describe the issues associated with short-read assembly, the different types of data produced by second-gen sequencers, and the latest assembly algorithms designed for these data. We also review the genomes that have been assembled recently from short reads and make recommendations for sequencing strategies that will yield a high-quality assembly.


Asunto(s)
Genómica/métodos , Análisis de Secuencia de ADN/métodos , Algoritmos , Secuencia de Bases , Genoma Humano , Humanos
11.
Nucleic Acids Res ; 37(11): e80, 2009 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-19429899

RESUMEN

Advances in sequencing technologies have accelerated the sequencing of new genomes, far outpacing the generation of gene and protein resources needed to annotate them. Direct comparison and alignment of existing cDNA sequences from a related species is an effective and readily available means to determine genes in the new genomes. Current spliced alignment programs are inadequate for comparing sequences between different species, owing to their low sensitivity and splice junction accuracy. A new spliced alignment tool, sim4cc, overcomes problems in the earlier tools by incorporating three new features: universal spaced seeds, to increase sensitivity and allow comparisons between species at various evolutionary distances, and powerful splice signal models and evolutionarily-aware alignment techniques, to improve the accuracy of gene models. When tested on vertebrate comparisons at diverse evolutionary distances, sim4cc had significantly higher sensitivity compared to existing alignment programs, more than 10% higher than the closest competitor for some comparisons, while being comparable in speed to its predecessor, sim4. Sim4cc can be used in one-to-one or one-to-many comparisons of genomic and cDNA sequences, and can also be effectively incorporated into a high-throughput annotation engine, as demonstrated by the mapping of 64,000 Fagus grandifolia 454 ESTs and unigenes to the poplar genome.


Asunto(s)
Genómica/métodos , Empalme del ARN , Alineación de Secuencia/métodos , Programas Informáticos , Algoritmos , Animales , Perros , Genoma de Planta , Humanos , Ratones , Estándares de Referencia , Alineación de Secuencia/normas , Vertebrados/genética
12.
Genome Biol ; 10(4): R42, 2009.
Artículo en Inglés | MEDLINE | ID: mdl-19393038

RESUMEN

BACKGROUND: The genome of the domestic cow, Bos taurus, was sequenced using a mixture of hierarchical and whole-genome shotgun sequencing methods. RESULTS: We have assembled the 35 million sequence reads and applied a variety of assembly improvement techniques, creating an assembly of 2.86 billion base pairs that has multiple improvements over previous assemblies: it is more complete, covering more of the genome; thousands of gaps have been closed; many erroneous inversions, deletions, and translocations have been corrected; and thousands of single-nucleotide errors have been corrected. Our evaluation using independent metrics demonstrates that the resulting assembly is substantially more accurate and complete than alternative versions. CONCLUSIONS: By using independent mapping data and conserved synteny between the cow and human genomes, we were able to construct an assembly with excellent large-scale contiguity in which a large majority (approximately 91%) of the genome has been placed onto the 30 B. taurus chromosomes. We constructed a new cow-human synteny map that expands upon previous maps. We also identified for the first time a portion of the B. taurus Y chromosome.


Asunto(s)
Bovinos/genética , Genoma/genética , Análisis de Secuencia de ADN/métodos , Animales , Mapeo Cromosómico , Femenino , Genoma Humano/genética , Genómica , Humanos , Masculino , Análisis de Secuencia de ADN/estadística & datos numéricos , Sintenía , Cromosoma Y/genética
13.
Bioinformatics ; 24(24): 2818-24, 2008 Dec 15.
Artículo en Inglés | MEDLINE | ID: mdl-18952627

RESUMEN

MOTIVATION: DNA sequence reads from Sanger and pyrosequencing platforms differ in cost, accuracy, typical coverage, average read length and the variety of available paired-end protocols. Both read types can complement one another in a 'hybrid' approach to whole-genome shotgun sequencing projects, but assembly software must be modified to accommodate their different characteristics. This is true even of pyrosequencing mated and unmated read combinations. Without special modifications, assemblers tuned for homogeneous sequence data may perform poorly on hybrid data. RESULTS: Celera Assembler was modified for combinations of ABI 3730 and 454 FLX reads. The revised pipeline called CABOG (Celera Assembler with the Best Overlap Graph) is robust to homopolymer run length uncertainty, high read coverage and heterogeneous read lengths. In tests on four genomes, it generated the longest contigs among all assemblers tested. It exploited the mate constraints provided by paired-end reads from either platform to build larger contigs and scaffolds, which were validated by comparison to a finished reference sequence. A low rate of contig mis-assembly was detected in some CABOG assemblies, but this was reduced in the presence of sufficient mate pair data. AVAILABILITY: The software is freely available as open-source from http://wgs-assembler.sf.net under the GNU Public License.


Asunto(s)
Análisis de Secuencia de ADN/métodos , Programas Informáticos , Biología Computacional/métodos , Genoma , Genómica
14.
BMC Genomics ; 9: 204, 2008 May 01.
Artículo en Inglés | MEDLINE | ID: mdl-18452608

RESUMEN

BACKGROUND: Xanthomonas oryzae pv. oryzae causes bacterial blight of rice (Oryza sativa L.), a major disease that constrains production of this staple crop in many parts of the world. We report here on the complete genome sequence of strain PXO99A and its comparison to two previously sequenced strains, KACC10331 and MAFF311018, which are highly similar to one another. RESULTS: The PXO99A genome is a single circular chromosome of 5,240,075 bp, considerably longer than the genomes of the other strains (4,941,439 bp and 4,940,217 bp, respectively), and it contains 5083 protein-coding genes, including 87 not found in KACC10331 or MAFF311018. PXO99A contains a greater number of virulence-associated transcription activator-like effector genes and has at least ten major chromosomal rearrangements relative to KACC10331 and MAFF311018. PXO99A contains numerous copies of diverse insertion sequence elements, members of which are associated with 7 out of 10 of the major rearrangements. A rapidly-evolving CRISPR (clustered regularly interspersed short palindromic repeats) region contains evidence of dozens of phage infections unique to the PXO99A lineage. PXO99A also contains a unique, near-perfect tandem repeat of 212 kilobases close to the replication terminus. CONCLUSION: Our results provide striking evidence of genome plasticity and rapid evolution within Xanthomonas oryzae pv. oryzae. The comparisons point to sources of genomic variation and candidates for strain-specific adaptations of this pathogen that help to explain the extraordinary diversity of Xanthomonas oryzae pv. oryzae genotypes and races that have been isolated from around the world.


Asunto(s)
Evolución Molecular , Genoma Bacteriano/genética , Oryza/microbiología , Xanthomonas/genética , Proteínas Bacterianas/genética , Secuencia de Bases , Elementos Transponibles de ADN/genética , Duplicación de Gen , Reordenamiento Génico , Transferencia de Gen Horizontal , Genómica , Repeticiones de Microsatélite , Reproducibilidad de los Resultados , Factores de Tiempo
15.
Nature ; 452(7190): 991-6, 2008 Apr 24.
Artículo en Inglés | MEDLINE | ID: mdl-18432245

RESUMEN

Papaya, a fruit crop cultivated in tropical and subtropical regions, is known for its nutritional benefits and medicinal applications. Here we report a 3x draft genome sequence of 'SunUp' papaya, the first commercial virus-resistant transgenic fruit tree to be sequenced. The papaya genome is three times the size of the Arabidopsis genome, but contains fewer genes, including significantly fewer disease-resistance gene analogues. Comparison of the five sequenced genomes suggests a minimal angiosperm gene set of 13,311. A lack of recent genome duplication, atypical of other angiosperm genomes sequenced so far, may account for the smaller papaya gene number in most functional groups. Nonetheless, striking amplifications in gene number within particular functional groups suggest roles in the evolution of tree-like habit, deposition and remobilization of starch reserves, attraction of seed dispersal agents, and adaptation to tropical daylengths. Transgenesis at three locations is closely associated with chloroplast insertions into the nuclear genome, and with topoisomerase I recognition sites. Papaya offers numerous advantages as a system for fruit-tree functional genomics, and this draft genome sequence provides the foundation for revealing the basis of Carica's distinguishing morpho-physiological, medicinal and nutritional properties.


Asunto(s)
Carica/genética , Genoma de Planta/genética , Arabidopsis/genética , Mapeo Contig , Bases de Datos Genéticas , Genes de Plantas/genética , Datos de Secuencia Molecular , Plantas Modificadas Genéticamente/genética , Alineación de Secuencia , Análisis de Secuencia de ADN , Factores de Transcripción/genética , Clima Tropical
16.
BMC Bioinformatics ; 8: 474, 2007 Dec 10.
Artículo en Inglés | MEDLINE | ID: mdl-18070356

RESUMEN

BACKGROUND: The recent availability of new, less expensive high-throughput DNA sequencing technologies has yielded a dramatic increase in the volume of sequence data that must be analyzed. These data are being generated for several purposes, including genotyping, genome resequencing, metagenomics, and de novo genome assembly projects. Sequence alignment programs such as MUMmer have proven essential for analysis of these data, but researchers will need ever faster, high-throughput alignment tools running on inexpensive hardware to keep up with new sequence technologies. RESULTS: This paper describes MUMmerGPU, an open-source high-throughput parallel pairwise local sequence alignment program that runs on commodity Graphics Processing Units (GPUs) in common workstations. MUMmerGPU uses the new Compute Unified Device Architecture (CUDA) from nVidia to align multiple query sequences against a single reference sequence stored as a suffix tree. By processing the queries in parallel on the highly parallel graphics card, MUMmerGPU achieves more than a 10-fold speedup over a serial CPU version of the sequence alignment kernel, and outperforms the exact alignment component of MUMmer on a high end CPU by 3.5-fold in total application time when aligning reads from recent sequencing projects using Solexa/Illumina, 454, and Sanger sequencing technologies. CONCLUSION: MUMmerGPU is a low cost, ultra-fast sequence alignment program designed to handle the increasing volume of data produced by new, high-throughput sequencing technologies. MUMmerGPU demonstrates that even memory-intensive applications can run significantly faster on the relatively low-cost GPU than on the CPU.


Asunto(s)
Gráficos por Computador/instrumentación , Sistemas de Administración de Bases de Datos , Alineación de Secuencia/economía , Alineación de Secuencia/instrumentación , Animales , Bacillus anthracis/genética , Secuencia de Bases , Caenorhabditis/genética , Gráficos por Computador/economía , Computadores/economía , Mapeo Contig/economía , Mapeo Contig/instrumentación , ADN/ultraestructura , Bases de Datos Genéticas , Biblioteca Genómica , Listeria monocytogenes/genética , Alineación de Secuencia/métodos , Análisis de Secuencia de ADN/economía , Análisis de Secuencia de ADN/instrumentación , Análisis de Secuencia de ADN/métodos , Streptococcus suis/genética , Factores de Tiempo , Simplificación del Trabajo
17.
Science ; 317(5845): 1756-60, 2007 Sep 21.
Artículo en Inglés | MEDLINE | ID: mdl-17885136

RESUMEN

Parasitic nematodes that cause elephantiasis and river blindness threaten hundreds of millions of people in the developing world. We have sequenced the approximately 90 megabase (Mb) genome of the human filarial parasite Brugia malayi and predict approximately 11,500 protein coding genes in 71 Mb of robustly assembled sequence. Comparative analysis with the free-living, model nematode Caenorhabditis elegans revealed that, despite these genes having maintained little conservation of local synteny during approximately 350 million years of evolution, they largely remain in linkage on chromosomal units. More than 100 conserved operons were identified. Analysis of the predicted proteome provides evidence for adaptations of B. malayi to niches in its human and vector hosts and insights into the molecular basis of a mutualistic relationship with its Wolbachia endosymbiont. These findings offer a foundation for rational drug design.


Asunto(s)
Brugia Malayi/genética , Genoma de los Helmintos , Animales , Brugia Malayi/fisiología , Caenorhabditis/genética , Drosophila melanogaster/genética , Resistencia a Medicamentos/genética , Filariasis/parasitología , Humanos , Datos de Secuencia Molecular
18.
Mol Biol Evol ; 24(9): 2091-8, 2007 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-17642473

RESUMEN

Overlapping genes are a common phenomenon. Among sequenced prokaryotes, more than 29% of all annotated genes overlap at least 1 of their 2 flanking genes. We present a unified model for the creation and repair of overlaps among adjacent genes where the 3' ends either overlap or nearly overlap. Our model, derived from a comprehensive analysis of complete prokaryotic genomes in GenBank, explains the nonuniform distribution of the lengths of such overlap regions far more simply than previously proposed models. Specifically, we explain the distribution of overlap lengths based on random extensions of genes to the next occurring downstream stop codon. Our model also provides an explanation for a newly observed (here) pattern in the distribution of the separation distances of closely spaced nonoverlapping genes. We provide evidence that the newly described biased distribution of separation distances is driven by the same phenomenon that creates the uneven distribution of overlap lengths. This suggests a dynamic picture of continual overlap creation and elimination.


Asunto(s)
Genes Arqueales/genética , Genes Bacterianos/genética , Genes Sobrepuestos/genética , Células Procariotas/metabolismo , Secuencia de Bases , Evolución Molecular , Genoma Arqueal , Genoma Bacteriano , Datos de Secuencia Molecular
19.
BMC Bioinformatics ; 8: 64, 2007 Feb 26.
Artículo en Inglés | MEDLINE | ID: mdl-17324286

RESUMEN

BACKGROUND: Genome assemblers have grown very large and complex in response to the need for algorithms to handle the challenges of large whole-genome sequencing projects. Many of the most common uses of assemblers, however, are best served by a simpler type of assembler that requires fewer software components, uses less memory, and is far easier to install and run. RESULTS: We have developed the Minimus assembler to address these issues, and tested it on a range of assembly problems. We show that Minimus performs well on several small assembly tasks, including the assembly of viral genomes, individual genes, and BAC clones. In addition, we evaluate Minimus' performance in assembling bacterial genomes in order to assess its suitability as a component of a larger assembly pipeline. We show that, unlike other software currently used for these tasks, Minimus produces significantly fewer assembly errors, at the cost of generating a more fragmented assembly. CONCLUSION: We find that for small genomes and other small assembly tasks, Minimus is faster and far more flexible than existing tools. Due to its small size and modular design Minimus is perfectly suited to be a component of complex assembly pipelines. Minimus is released as an open-source software project and the code is available as part of the AMOS project at Sourceforge.


Asunto(s)
Algoritmos , Mapeo Cromosómico/métodos , ADN/química , ADN/genética , Alineación de Secuencia/métodos , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Secuencia de Bases , Datos de Secuencia Molecular , Diseño de Software , Interfaz Usuario-Computador
20.
Bioinformatics ; 23(6): 673-9, 2007 Mar 15.
Artículo en Inglés | MEDLINE | ID: mdl-17237039

RESUMEN

MOTIVATION: The Glimmer gene-finding software has been successfully used for finding genes in bacteria, archaea and viruses representing hundreds of species. We describe several major changes to the Glimmer system, including improved methods for identifying both coding regions and start codons. We also describe a new module of Glimmer that can distinguish host and endosymbiont DNA. This module was developed in response to the discovery that eukaryotic genome sequencing projects sometimes inadvertently capture the DNA of intracellular bacteria living in the host. RESULTS: The new methods dramatically reduce the rate of false-positive predictions, while maintaining Glimmer's 99% sensitivity rate at detecting genes in most species, and they find substantially more correct start sites, as measured by comparisons to known and well-curated genes. We show that our interpolated Markov model (IMM) DNA discriminator correctly separated 99% of the sequences in a recent genome project that produced a mixture of sequences from the bacterium Prochloron didemni and its sea squirt host, Lissoclinum patella. AVAILABILITY: Glimmer is OSI Certified Open Source and available at http://cbcb.umd.edu/software/glimmer.


Asunto(s)
Mapeo Cromosómico/métodos , ADN Bacteriano/genética , Decapodiformes/genética , Genoma Bacteriano/genética , Prochloron/genética , Programas Informáticos , Simbiosis/genética , Algoritmos , Animales , Artefactos , Decapodiformes/microbiología , Reproducibilidad de los Resultados , Sensibilidad y Especificidad , Análisis de Secuencia de ADN/métodos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...