Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 35
Filtrar
Más filtros

Banco de datos
Tipo del documento
Intervalo de año de publicación
1.
Nat Methods ; 19(6): 696-704, 2022 06.
Artículo en Inglés | MEDLINE | ID: mdl-35361932

RESUMEN

Variant calling has been widely used for genotyping and for improving the consensus accuracy of long-read assemblies. Variant calls are commonly hard-filtered with user-defined cutoffs. However, it is impossible to define a single set of optimal cutoffs, as the calls heavily depend on the quality of the reads, the variant caller of choice and the quality of the unpolished assembly. Here, we introduce Merfin, a k-mer based variant-filtering algorithm for improved accuracy in genotyping and genome assembly polishing. Merfin evaluates each variant based on the expected k-mer multiplicity in the reads, independently of the quality of the read alignment and variant caller's internal score. Merfin increased the precision of genotyped calls in several benchmarks, improved consensus accuracy and reduced frameshift errors when applied to human and nonhuman assemblies built from Pacific Biosciences HiFi and continuous long reads or Oxford Nanopore reads, including the first complete human genome. Moreover, we introduce assembly quality and completeness metrics that account for the expected genomic copy numbers.


Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento , Nanoporos , Genoma , Genómica , Humanos , Análisis de Secuencia de ADN
2.
Genome Res ; 30(9): 1291-1305, 2020 09.
Artículo en Inglés | MEDLINE | ID: mdl-32801147

RESUMEN

Complete and accurate genome assemblies form the basis of most downstream genomic analyses and are of critical importance. Recent genome assembly projects have relied on a combination of noisy long-read sequencing and accurate short-read sequencing, with the former offering greater assembly continuity and the latter providing higher consensus accuracy. The recently introduced Pacific Biosciences (PacBio) HiFi sequencing technology bridges this divide by delivering long reads (>10 kbp) with high per-base accuracy (>99.9%). Here we present HiCanu, a modification of the Canu assembler designed to leverage the full potential of HiFi reads via homopolymer compression, overlap-based error correction, and aggressive false overlap filtering. We benchmark HiCanu with a focus on the recovery of haplotype diversity, major histocompatibility complex (MHC) variants, satellite DNAs, and segmental duplications. For diploid human genomes sequenced to 30× HiFi coverage, HiCanu achieved superior accuracy and allele recovery compared to the current state of the art. On the effectively haploid CHM13 human cell line, HiCanu achieved an NG50 contig size of 77 Mbp with a per-base consensus accuracy of 99.999% (QV50), surpassing recent assemblies of high-coverage, ultralong Oxford Nanopore Technologies (ONT) reads in terms of both accuracy and continuity. This HiCanu assembly correctly resolves 337 out of 341 validation BACs sampled from known segmental duplications and provides the first preliminary assemblies of nine complete human centromeric regions. Although gaps and errors still remain within the most challenging regions of the genome, these results represent a significant advance toward the complete assembly of human genomes.


Asunto(s)
Variación Genética , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Análisis de Secuencia de ADN/métodos , Alelos , Animales , Línea Celular , Duplicación Cromosómica , ADN de Neoplasias , ADN Satélite , Drosophila/genética , Genoma Humano , Haplotipos , Humanos , Reproducibilidad de los Resultados , Programas Informáticos
3.
Nature ; 533(7602): 200-5, 2016 05 12.
Artículo en Inglés | MEDLINE | ID: mdl-27088604

RESUMEN

The whole-genome duplication 80 million years ago of the common ancestor of salmonids (salmonid-specific fourth vertebrate whole-genome duplication, Ss4R) provides unique opportunities to learn about the evolutionary fate of a duplicated vertebrate genome in 70 extant lineages. Here we present a high-quality genome assembly for Atlantic salmon (Salmo salar), and show that large genomic reorganizations, coinciding with bursts of transposon-mediated repeat expansions, were crucial for the post-Ss4R rediploidization process. Comparisons of duplicate gene expression patterns across a wide range of tissues with orthologous genes from a pre-Ss4R outgroup unexpectedly demonstrate far more instances of neofunctionalization than subfunctionalization. Surprisingly, we find that genes that were retained as duplicates after the teleost-specific whole-genome duplication 320 million years ago were not more likely to be retained after the Ss4R, and that the duplicate retention was not influenced to a great extent by the nature of the predicted protein interactions of the gene products. Finally, we demonstrate that the Atlantic salmon assembly can serve as a reference sequence for the study of other salmonids for a range of purposes.


Asunto(s)
Diploidia , Evolución Molecular , Duplicación de Gen/genética , Genes Duplicados/genética , Genoma/genética , Salmo salar/genética , Animales , Elementos Transponibles de ADN/genética , Femenino , Genómica , Masculino , Modelos Genéticos , Mutagénesis/genética , Filogenia , Estándares de Referencia , Salmo salar/clasificación , Homología de Secuencia
4.
Bioinformatics ; 36(Suppl_1): i111-i118, 2020 07 01.
Artículo en Inglés | MEDLINE | ID: mdl-32657365

RESUMEN

MOTIVATION: In this era of exponential data growth, minimizer sampling has become a standard algorithmic technique for rapid genome sequence comparison. This technique yields a sub-linear representation of sequences, enabling their comparison in reduced space and time. A key property of the minimizer technique is that if two sequences share a substring of a specified length, then they can be guaranteed to have a matching minimizer. However, because the k-mer distribution in eukaryotic genomes is highly uneven, minimizer-based tools (e.g. Minimap2, Mashmap) opt to discard the most frequently occurring minimizers from the genome to avoid excessive false positives. By doing so, the underlying guarantee is lost and accuracy is reduced in repetitive genomic regions. RESULTS: We introduce a novel weighted-minimizer sampling algorithm. A unique feature of the proposed algorithm is that it performs minimizer sampling while considering a weight for each k-mer; i.e. the higher the weight of a k-mer, the more likely it is to be selected. By down-weighting frequently occurring k-mers, we are able to meet both objectives: (i) avoid excessive false-positive matches and (ii) maintain the minimizer match guarantee. We tested our algorithm, Winnowmap, using both simulated and real long-read data and compared it to a state-of-the-art long read mapper, Minimap2. Our results demonstrate a reduction in the mapping error-rate from 0.14% to 0.06% in the recently finished human X chromosome (154.3 Mbp), and from 3.6% to 0% within the highly repetitive X centromere (3.1 Mbp). Winnowmap improves mapping accuracy within repeats and achieves these results with sparser sampling, leading to better index compression and competitive runtimes. AVAILABILITY AND IMPLEMENTATION: Winnowmap is built on top of the Minimap2 codebase and is available at https://github.com/marbl/winnowmap.


Asunto(s)
Compresión de Datos , Programas Informáticos , Algoritmos , Genómica , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Análisis de Secuencia de ADN
5.
Genome Res ; 27(5): 722-736, 2017 05.
Artículo en Inglés | MEDLINE | ID: mdl-28298431

RESUMEN

Long-read single-molecule sequencing has revolutionized de novo genome assembly and enabled the automated reconstruction of reference-quality genomes. However, given the relatively high error rates of such technologies, efficient and accurate assembly of large repeats and closely related haplotypes remains challenging. We address these issues with Canu, a successor of Celera Assembler that is specifically designed for noisy single-molecule sequences. Canu introduces support for nanopore sequencing, halves depth-of-coverage requirements, and improves assembly continuity while simultaneously reducing runtime by an order of magnitude on large genomes versus Celera Assembler 8.2. These advances result from new overlapping and assembly algorithms, including an adaptive overlapping strategy based on tf-idf weighted MinHash and a sparse assembly graph construction that avoids collapsing diverged repeats and haplotypes. We demonstrate that Canu can reliably assemble complete microbial genomes and near-complete eukaryotic chromosomes using either Pacific Biosciences (PacBio) or Oxford Nanopore technologies and achieves a contig NG50 of >21 Mbp on both human and Drosophila melanogaster PacBio data sets. For assembly structures that cannot be linearly represented, Canu provides graph-based assembly outputs in graphical fragment assembly (GFA) format for analysis or integration with complementary phasing and scaffolding techniques. The combination of such highly resolved assembly graphs with long-range scaffolding information promises the complete and automated assembly of complex genomes.


Asunto(s)
Mapeo Contig/métodos , Genómica/métodos , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Animales , Mapeo Contig/normas , Drosophila melanogaster/genética , Genoma Bacteriano , Genómica/normas , Humanos , Secuencias Repetitivas de Ácidos Nucleicos , Análisis de Secuencia de ADN/normas
6.
PLoS Comput Biol ; 15(8): e1007273, 2019 08.
Artículo en Inglés | MEDLINE | ID: mdl-31433799

RESUMEN

Long-read sequencing and novel long-range assays have revolutionized de novo genome assembly by automating the reconstruction of reference-quality genomes. In particular, Hi-C sequencing is becoming an economical method for generating chromosome-scale scaffolds. Despite its increasing popularity, there are limited open-source tools available. Errors, particularly inversions and fusions across chromosomes, remain higher than alternate scaffolding technologies. We present a novel open-source Hi-C scaffolder that does not require an a priori estimate of chromosome number and minimizes errors by scaffolding with the assistance of an assembly graph. We demonstrate higher accuracy than the state-of-the-art methods across a variety of Hi-C library preparations and input assembly sizes. The Python and C++ code for our method is openly available at https://github.com/machinegun/SALSA.


Asunto(s)
Cromosomas Humanos/genética , Genoma Humano , Genómica/métodos , Algoritmos , Animales , Biología Computacional , Simulación por Computador , Bases de Datos de Ácidos Nucleicos/estadística & datos numéricos , Biblioteca Genómica , Genómica/estadística & datos numéricos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/estadística & datos numéricos , Humanos , Análisis de Secuencia de ADN/métodos , Análisis de Secuencia de ADN/estadística & datos numéricos , Programas Informáticos
7.
Nature ; 486(7404): 527-31, 2012 Jun 28.
Artículo en Inglés | MEDLINE | ID: mdl-22722832

RESUMEN

Two African apes are the closest living relatives of humans: the chimpanzee (Pan troglodytes) and the bonobo (Pan paniscus). Although they are similar in many respects, bonobos and chimpanzees differ strikingly in key social and sexual behaviours, and for some of these traits they show more similarity with humans than with each other. Here we report the sequencing and assembly of the bonobo genome to study its evolutionary relationship with the chimpanzee and human genomes. We find that more than three per cent of the human genome is more closely related to either the bonobo or the chimpanzee genome than these are to each other. These regions allow various aspects of the ancestry of the two ape species to be reconstructed. In addition, many of the regions that overlap genes may eventually help us understand the genetic basis of phenotypes that humans share with one of the two apes to the exclusion of the other.


Asunto(s)
Evolución Molecular , Variación Genética/genética , Genoma Humano/genética , Genoma/genética , Pan paniscus/genética , Pan troglodytes/genética , Animales , Elementos Transponibles de ADN/genética , Duplicación de Gen/genética , Genotipo , Humanos , Datos de Secuencia Molecular , Fenotipo , Filogenia , Especificidad de la Especie
8.
BMC Genomics ; 18(1): 95, 2017 01 18.
Artículo en Inglés | MEDLINE | ID: mdl-28100185

RESUMEN

BACKGROUND: The first Atlantic cod (Gadus morhua) genome assembly published in 2011 was one of the early genome assemblies exclusively based on high-throughput 454 pyrosequencing. Since then, rapid advances in sequencing technologies have led to a multitude of assemblies generated for complex genomes, although many of these are of a fragmented nature with a significant fraction of bases in gaps. The development of long-read sequencing and improved software now enable the generation of more contiguous genome assemblies. RESULTS: By combining data from Illumina, 454 and the longer PacBio sequencing technologies, as well as integrating the results of multiple assembly programs, we have created a substantially improved version of the Atlantic cod genome assembly. The sequence contiguity of this assembly is increased fifty-fold and the proportion of gap-bases has been reduced fifteen-fold. Compared to other vertebrates, the assembly contains an unusual high density of tandem repeats (TRs). Indeed, retrospective analyses reveal that gaps in the first genome assembly were largely associated with these TRs. We show that 21% of the TRs across the assembly, 19% in the promoter regions and 12% in the coding sequences are heterozygous in the sequenced individual. CONCLUSIONS: The inclusion of PacBio reads combined with the use of multiple assembly programs drastically improved the Atlantic cod genome assembly by successfully resolving long TRs. The high frequency of heterozygous TRs within or in the vicinity of genes in the genome indicate a considerable standing genomic variation in Atlantic cod populations, which is likely of evolutionary importance.


Asunto(s)
Gadus morhua/genética , Genómica/métodos , Secuencias Repetidas en Tándem/genética , Animales , Heterocigoto , Anotación de Secuencia Molecular , Regiones Promotoras Genéticas , Análisis de Secuencia de ADN
9.
BMC Genomics ; 18(1): 541, 2017 07 19.
Artículo en Inglés | MEDLINE | ID: mdl-28724409

RESUMEN

BACKGROUND: Long-read and short-read sequencing technologies offer competing advantages for eukaryotic genome sequencing projects. Combinations of both may be appropriate for surveys of within-species genomic variation. METHODS: We developed a hybrid assembly pipeline called "Alpaca" that can operate on 20X long-read coverage plus about 50X short-insert and 50X long-insert short-read coverage. To preclude collapse of tandem repeats, Alpaca relies on base-call-corrected long reads for contig formation. RESULTS: Compared to two other assembly protocols, Alpaca demonstrated the most reference agreement and repeat capture on the rice genome. On three accessions of the model legume Medicago truncatula, Alpaca generated the most agreement to a conspecific reference and predicted tandemly repeated genes absent from the other assemblies. CONCLUSION: Our results suggest Alpaca is a useful tool for investigating structural and copy number variation within de novo assemblies of sampled populations.


Asunto(s)
Genes de Plantas/genética , Genómica/métodos , Variaciones en el Número de Copia de ADN , Medicago truncatula/genética , Familia de Multigenes/genética , Oryza/genética , Fenotipo , Secuencias Repetidas en Tándem/genética
10.
Nature ; 464(7288): 592-6, 2010 Mar 25.
Artículo en Inglés | MEDLINE | ID: mdl-20228792

RESUMEN

The freshwater cnidarian Hydra was first described in 1702 and has been the object of study for 300 years. Experimental studies of Hydra between 1736 and 1744 culminated in the discovery of asexual reproduction of an animal by budding, the first description of regeneration in an animal, and successful transplantation of tissue between animals. Today, Hydra is an important model for studies of axial patterning, stem cell biology and regeneration. Here we report the genome of Hydra magnipapillata and compare it to the genomes of the anthozoan Nematostella vectensis and other animals. The Hydra genome has been shaped by bursts of transposable element expansion, horizontal gene transfer, trans-splicing, and simplification of gene structure and gene content that parallel simplification of the Hydra life cycle. We also report the sequence of the genome of a novel bacterium stably associated with H. magnipapillata. Comparisons of the Hydra genome to the genomes of other animals shed light on the evolution of epithelia, contractile tissues, developmentally regulated transcription factors, the Spemann-Mangold organizer, pluripotency genes and the neuromuscular junction.


Asunto(s)
Genoma/genética , Hydra/genética , Animales , Antozoos/genética , Comamonadaceae/genética , Elementos Transponibles de ADN/genética , Transferencia de Gen Horizontal/genética , Genoma Bacteriano/genética , Hydra/microbiología , Hydra/ultraestructura , Datos de Secuencia Molecular , Unión Neuromuscular/ultraestructura
11.
Nucleic Acids Res ; 41(15): 7387-400, 2013 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-23761445

RESUMEN

Anopheles darlingi is the principal neotropical malaria vector, responsible for more than a million cases of malaria per year on the American continent. Anopheles darlingi diverged from the African and Asian malaria vectors ∼100 million years ago (mya) and successfully adapted to the New World environment. Here we present an annotated reference A. darlingi genome, sequenced from a wild population of males and females collected in the Brazilian Amazon. A total of 10 481 predicted protein-coding genes were annotated, 72% of which have their closest counterpart in Anopheles gambiae and 21% have highest similarity with other mosquito species. In spite of a long period of divergent evolution, conserved gene synteny was observed between A. darlingi and A. gambiae. More than 10 million single nucleotide polymorphisms and short indels with potential use as genetic markers were identified. Transposable elements correspond to 2.3% of the A. darlingi genome. Genes associated with hematophagy, immunity and insecticide resistance, directly involved in vector-human and vector-parasite interactions, were identified and discussed. This study represents the first effort to sequence the genome of a neotropical malaria vector, and opens a new window through which we can contemplate the evolutionary history of anopheline mosquitoes. It also provides valuable information that may lead to novel strategies to reduce malaria transmission on the South American continent. The A. darlingi genome is accessible at www.labinfo.lncc.br/index.php/anopheles-darlingi.


Asunto(s)
Anopheles/genética , Genoma de los Insectos , Insectos Vectores/genética , Animales , Anopheles/clasificación , Brasil , Cromosomas de Insectos/genética , Elementos Transponibles de ADN , Evolución Molecular , Femenino , Variación Genética , Interacciones Huésped-Parásitos , Proteínas de Insectos/genética , Insectos Vectores/clasificación , Resistencia a los Insecticidas , Insecticidas/farmacología , Malaria/parasitología , Masculino , Anotación de Secuencia Molecular , Filogenia , Sintenía , Transcriptoma
12.
Proc Natl Acad Sci U S A ; 108(30): 12348-53, 2011 Jul 26.
Artículo en Inglés | MEDLINE | ID: mdl-21709235

RESUMEN

The Tasmanian devil (Sarcophilus harrisii) is threatened with extinction because of a contagious cancer known as Devil Facial Tumor Disease. The inability to mount an immune response and to reject these tumors might be caused by a lack of genetic diversity within a dwindling population. Here we report a whole-genome analysis of two animals originating from extreme northwest and southeast Tasmania, the maximal geographic spread, together with the genome from a tumor taken from one of them. A 3.3-Gb de novo assembly of the sequence data from two complementary next-generation sequencing platforms was used to identify 1 million polymorphic genomic positions, roughly one-quarter of the number observed between two genetically distant human genomes. Analysis of 14 complete mitochondrial genomes from current and museum specimens, as well as mitochondrial and nuclear SNP markers in 175 animals, suggests that the observed low genetic diversity in today's population preceded the Devil Facial Tumor Disease disease outbreak by at least 100 y. Using a genetically characterized breeding stock based on the genome sequence will enable preservation of the extant genetic diversity in future Tasmanian devil populations.


Asunto(s)
Variación Genética , Marsupiales/genética , Animales , Cruzamiento , ADN Mitocondrial/genética , ADN de Neoplasias/genética , Extinción Biológica , Neoplasias Faciales/genética , Neoplasias Faciales/veterinaria , Genética de Población , Genoma Mitocondrial , Humanos , Modelos Moleculares , Datos de Secuencia Molecular , Proteínas de Neoplasias/química , Proteínas de Neoplasias/genética , Neoplasias/genética , Neoplasias/veterinaria , Filogenia , Polimorfismo de Nucleótido Simple , Tasmania , Factores de Tiempo
13.
Proc Natl Acad Sci U S A ; 107(27): 12168-73, 2010 Jul 06.
Artículo en Inglés | MEDLINE | ID: mdl-20566863

RESUMEN

As an obligatory parasite of humans, the body louse (Pediculus humanus humanus) is an important vector for human diseases, including epidemic typhus, relapsing fever, and trench fever. Here, we present genome sequences of the body louse and its primary bacterial endosymbiont Candidatus Riesia pediculicola. The body louse has the smallest known insect genome, spanning 108 Mb. Despite its status as an obligate parasite, it retains a remarkably complete basal insect repertoire of 10,773 protein-coding genes and 57 microRNAs. Representing hemimetabolous insects, the genome of the body louse thus provides a reference for studies of holometabolous insects. Compared with other insect genomes, the body louse genome contains significantly fewer genes associated with environmental sensing and response, including odorant and gustatory receptors and detoxifying enzymes. The unique architecture of the 18 minicircular mitochondrial chromosomes of the body louse may be linked to the loss of the gene encoding the mitochondrial single-stranded DNA binding protein. The genome of the obligatory louse endosymbiont Candidatus Riesia pediculicola encodes less than 600 genes on a short, linear chromosome and a circular plasmid. The plasmid harbors a unique arrangement of genes required for the synthesis of pantothenate, an essential vitamin deficient in the louse diet. The human body louse, its primary endosymbiont, and the bacterial pathogens that it vectors all possess genomes reduced in size compared with their free-living close relatives. Thus, the body louse genome project offers unique information and tools to use in advancing understanding of coevolution among vectors, symbionts, and pathogens.


Asunto(s)
Genoma Bacteriano/genética , Genoma de los Insectos/genética , Pediculus/genética , Pediculus/microbiología , Animales , Enterobacteriaceae/genética , Genes Bacterianos/genética , Genes de Insecto/genética , Genómica/métodos , Humanos , Infestaciones por Piojos/parasitología , Datos de Secuencia Molecular , Análisis de Secuencia de ADN , Simbiosis
14.
Nat Biotechnol ; 41(10): 1474-1482, 2023 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-36797493

RESUMEN

The Telomere-to-Telomere consortium recently assembled the first truly complete sequence of a human genome. To resolve the most complex repeats, this project relied on manual integration of ultra-long Oxford Nanopore sequencing reads with a high-resolution assembly graph built from long, accurate PacBio high-fidelity reads. We have improved and automated this strategy in Verkko, an iterative, graph-based pipeline for assembling complete, diploid genomes. Verkko begins with a multiplex de Bruijn graph built from long, accurate reads and progressively simplifies this graph by integrating ultra-long reads and haplotype-specific markers. The result is a phased, diploid assembly of both haplotypes, with many chromosomes automatically assembled from telomere to telomere. Running Verkko on the HG002 human genome resulted in 20 of 46 diploid chromosomes assembled without gaps at 99.9997% accuracy. The complete assembly of diploid genomes is a critical step towards the construction of comprehensive pangenome databases and chromosome-scale comparative genomics.


Asunto(s)
Diploidia , Genómica , Humanos , Análisis de Secuencia de ADN/métodos , Genómica/métodos , Genoma Humano/genética , Telómero/genética , Secuenciación de Nucleótidos de Alto Rendimiento/métodos
15.
J Bacteriol ; 194(2): 376-94, 2012 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-22056929

RESUMEN

We present the draft genome for the Rickettsia endosymbiont of Ixodes scapularis (REIS), a symbiont of the deer tick vector of Lyme disease in North America. Among Rickettsia species (Alphaproteobacteria: Rickettsiales), REIS has the largest genome sequenced to date (>2 Mb) and contains 2,309 genes across the chromosome and four plasmids (pREIS1 to pREIS4). The most remarkable finding within the REIS genome is the extraordinary proliferation of mobile genetic elements (MGEs), which contributes to a limited synteny with other Rickettsia genomes. In particular, an integrative conjugative element named RAGE (for Rickettsiales amplified genetic element), previously identified in scrub typhus rickettsiae (Orientia tsutsugamushi) genomes, is present on both the REIS chromosome and plasmids. Unlike the pseudogene-laden RAGEs of O. tsutsugamushi, REIS encodes nine conserved RAGEs that include F-like type IV secretion systems similar to that of the tra genes encoded in the Rickettsia bellii and R. massiliae genomes. An unparalleled abundance of encoded transposases (>650) relative to genome size, together with the RAGEs and other MGEs, comprise ~35% of the total genome, making REIS one of the most plastic and repetitive bacterial genomes sequenced to date. We present evidence that conserved rickettsial genes associated with an intracellular lifestyle were acquired via MGEs, especially the RAGE, through a continuum of genomic invasions. Robust phylogeny estimation suggests REIS is ancestral to the virulent spotted fever group of rickettsiae. As REIS is not known to invade vertebrate cells and has no known pathogenic effects on I. scapularis, its genome sequence provides insight on the origin of mechanisms of rickettsial pathogenicity.


Asunto(s)
Regulación Bacteriana de la Expresión Génica/fisiología , Genoma Bacteriano , Secuencias Repetitivas Esparcidas , Ixodes/microbiología , Rickettsia/genética , Animales , Vectores Arácnidos/microbiología , Evolución Biológica , Mapeo Cromosómico , Cromosomas Bacterianos , Datos de Secuencia Molecular , Plásmidos , Simbiosis
16.
Bioinformatics ; 27(13): 1869-70, 2011 Jul 01.
Artículo en Inglés | MEDLINE | ID: mdl-21551146

RESUMEN

UNLABELLED: The large number of genomes that will be sequenced will need to be annotated with genes and other functional features. Aligning gene sequences from a related species to the target genome is an economical and highly reliable method to identify genes; unfortunately, existing tools have been lacking in sensitivity and speed. A program we reported, sim4cc, was shown to be highly accurate but is limited to comparing one cDNA with one genomic sequence. We present here an optimization of the tool, implemented in the packages sim4db and leaff. The new tool performs batch alignments of cDNA and genomic sequences in a fraction of the time required by its predecessor, and thus is very well suited for genome-wide analyses. AVAILABILITY: Sim4db and leaff are written in C, C++ and Perl for Linux and other Unix platforms. Source code is distributed free of charge from http://sourceforge.net/projects/kmer/. CONTACT: florea@umiacs.umd.edu


Asunto(s)
Alineación de Secuencia/métodos , Programas Informáticos , Animales , Secuencia de Bases , Evolución Biológica , ADN Complementario/genética , Genoma , Humanos , Empalme del ARN , Vertebrados/genética
17.
PLoS Genet ; 4(8): e1000160, 2008 Aug 15.
Artículo en Inglés | MEDLINE | ID: mdl-18704161

RESUMEN

There is much interest in characterizing the variation in a human individual, because this may elucidate what contributes significantly to a person's phenotype, thereby enabling personalized genomics. We focus here on the variants in a person's 'exome,' which is the set of exons in a genome, because the exome is believed to harbor much of the functional variation. We provide an analysis of the approximately 12,500 variants that affect the protein coding portion of an individual's genome. We identified approximately 10,400 nonsynonymous single nucleotide polymorphisms (nsSNPs) in this individual, of which approximately 15-20% are rare in the human population. We predict approximately 1,500 nsSNPs affect protein function and these tend be heterozygous, rare, or novel. Of the approximately 700 coding indels, approximately half tend to have lengths that are a multiple of three, which causes insertions/deletions of amino acids in the corresponding protein, rather than introducing frameshifts. Coding indels also occur frequently at the termini of genes, so even if an indel causes a frameshift, an alternative start or stop site in the gene can still be used to make a functional protein. In summary, we reduced the set of approximately 12,500 nonsilent coding variants by approximately 8-fold to a set of variants that are most likely to have major effects on their proteins' functions. This is our first glimpse of an individual's exome and a snapshot of the current state of personalized genomics. The majority of coding variants in this individual are common and appear to be functionally neutral. Our results also indicate that some variants can be used to improve the current NCBI human reference genome. As more genomes are sequenced, many rare variants and non-SNP variants will be discovered. We present an approach to analyze the coding variation in humans by proposing multiple bioinformatic methods to hone in on possible functional variation.


Asunto(s)
Exones , Variación Genética , Genoma Humano , Polimorfismo de Nucleótido Simple , Frecuencia de los Genes , Enfermedades Genéticas Congénitas/genética , Humanos , Masculino , Mutación , Fenotipo , Proteínas/genética , Proteínas/metabolismo
18.
Sci Rep ; 11(1): 2997, 2021 02 04.
Artículo en Inglés | MEDLINE | ID: mdl-33542373

RESUMEN

The rDNA clusters and flanking sequences on human chromosomes 13, 14, 15, 21 and 22 represent large gaps in the current genomic assembly. The organization and the degree of divergence of the human rDNA units within an individual nucleolar organizer region (NOR) are only partially known. To address this lacuna, we previously applied transformation-associated recombination (TAR) cloning to isolate individual rDNA units from chromosome 21. That approach revealed an unexpectedly high level of heterogeneity in human rDNA, raising the possibility of corresponding variations in ribosome dynamics. We have now applied the same strategy to analyze an entire rDNA array end-to-end from a copy of chromosome 22. Sequencing of TAR isolates provided the entire NOR sequence, including proximal and distal junctions that may be involved in nucleolar function. Comparison of the newly sequenced rDNAs to reference sequence for chromosomes 22 and 21 revealed variants that are shared in human rDNA in individuals from different ethnic groups, many of them at high frequency. Analysis infers comparable intra- and inter-individual divergence of rDNA units on the same and different chromosomes, supporting the concerted evolution of rDNA units. The results provide a route to investigate further the role of rDNA variation in nucleolar formation and in the empirical associations of nucleoli with pathology.


Asunto(s)
Cromosomas Humanos Par 22/genética , ADN Ribosómico/genética , Genoma Humano/genética , Región Organizadora del Nucléolo/genética , Nucléolo Celular/genética , Clonación Molecular , Heterogeneidad Genética , Genómica , Humanos , Anotación de Secuencia Molecular , Ribosomas/genética
19.
BMC Bioinformatics ; 11: 457, 2010 Sep 10.
Artículo en Inglés | MEDLINE | ID: mdl-20831800

RESUMEN

BACKGROUND: Finishing is the process of improving the quality and utility of draft genome sequences generated by shotgun sequencing and computational assembly. Finishing can involve targeted sequencing. Finishing reads may be incorporated by manual or automated means. One automated method uses targeted addition by local re-assembly of gap regions. An obvious alternative uses de novo assembly of all the reads. RESULTS: A procedure called the bounding read algorithm was developed for assembly of shotgun reads plus finishing reads and their constraints, targeting repeat regions. The algorithm was implemented within the Celera Assembler software and its pyrosequencing-specific variant, CABOG. The implementation was tested on Sanger and pyrosequencing data from six genomes. The bounding read assemblies were compared to assemblies from two other methods on the same data. The algorithm generates improved assemblies of repeat regions, closing and tiling some gaps while degrading none. CONCLUSIONS: The algorithm is useful for small-genome automated finishing projects. Our implementation is available as open-source from http://wgs-assembler.sourceforge.net under the GNU Public License.


Asunto(s)
Algoritmos , Análisis de Secuencia de ADN/métodos , Bases de Datos Factuales , Genoma
20.
PLoS Biol ; 5(10): e254, 2007 Sep 04.
Artículo en Inglés | MEDLINE | ID: mdl-17803354

RESUMEN

Presented here is a genome sequence of an individual human. It was produced from approximately 32 million random DNA fragments, sequenced by Sanger dideoxy technology and assembled into 4,528 scaffolds, comprising 2,810 million bases (Mb) of contiguous sequence with approximately 7.5-fold coverage for any given region. We developed a modified version of the Celera assembler to facilitate the identification and comparison of alternate alleles within this individual diploid genome. Comparison of this genome and the National Center for Biotechnology Information human reference assembly revealed more than 4.1 million DNA variants, encompassing 12.3 Mb. These variants (of which 1,288,319 were novel) included 3,213,401 single nucleotide polymorphisms (SNPs), 53,823 block substitutions (2-206 bp), 292,102 heterozygous insertion/deletion events (indels)(1-571 bp), 559,473 homozygous indels (1-82,711 bp), 90 inversions, as well as numerous segmental duplications and copy number variation regions. Non-SNP DNA variation accounts for 22% of all events identified in the donor, however they involve 74% of all variant bases. This suggests an important role for non-SNP genetic alterations in defining the diploid genome structure. Moreover, 44% of genes were heterozygous for one or more variants. Using a novel haplotype assembly strategy, we were able to span 1.5 Gb of genome sequence in segments >200 kb, providing further precision to the diploid nature of the genome. These data depict a definitive molecular portrait of a diploid human genome that provides a starting point for future genome comparisons and enables an era of individualized genomic information.


Asunto(s)
Mapeo Cromosómico , Diploidia , Genoma Humano , Análisis de Secuencia de ADN , Secuencia de Bases , Mapeo Cromosómico/instrumentación , Mapeo Cromosómico/métodos , Cromosomas Humanos , Cromosomas Humanos Y/genética , Dosificación de Gen , Genotipo , Haplotipos , Proyecto Genoma Humano , Humanos , Mutación INDEL , Hibridación Fluorescente in Situ , Masculino , Análisis por Micromatrices , Persona de Mediana Edad , Datos de Secuencia Molecular , Linaje , Fenotipo , Polimorfismo de Nucleótido Simple , Reproducibilidad de los Resultados , Análisis de Secuencia de ADN/instrumentación , Análisis de Secuencia de ADN/métodos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA