Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 35
Filtrar
1.
Nat Methods ; 19(6): 696-704, 2022 06.
Artigo em Inglês | MEDLINE | ID: mdl-35361932

RESUMO

Variant calling has been widely used for genotyping and for improving the consensus accuracy of long-read assemblies. Variant calls are commonly hard-filtered with user-defined cutoffs. However, it is impossible to define a single set of optimal cutoffs, as the calls heavily depend on the quality of the reads, the variant caller of choice and the quality of the unpolished assembly. Here, we introduce Merfin, a k-mer based variant-filtering algorithm for improved accuracy in genotyping and genome assembly polishing. Merfin evaluates each variant based on the expected k-mer multiplicity in the reads, independently of the quality of the read alignment and variant caller's internal score. Merfin increased the precision of genotyped calls in several benchmarks, improved consensus accuracy and reduced frameshift errors when applied to human and nonhuman assemblies built from Pacific Biosciences HiFi and continuous long reads or Oxford Nanopore reads, including the first complete human genome. Moreover, we introduce assembly quality and completeness metrics that account for the expected genomic copy numbers.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Nanoporos , Genoma , Genômica , Humanos , Análise de Sequência de DNA
2.
Genome Res ; 30(9): 1291-1305, 2020 09.
Artigo em Inglês | MEDLINE | ID: mdl-32801147

RESUMO

Complete and accurate genome assemblies form the basis of most downstream genomic analyses and are of critical importance. Recent genome assembly projects have relied on a combination of noisy long-read sequencing and accurate short-read sequencing, with the former offering greater assembly continuity and the latter providing higher consensus accuracy. The recently introduced Pacific Biosciences (PacBio) HiFi sequencing technology bridges this divide by delivering long reads (>10 kbp) with high per-base accuracy (>99.9%). Here we present HiCanu, a modification of the Canu assembler designed to leverage the full potential of HiFi reads via homopolymer compression, overlap-based error correction, and aggressive false overlap filtering. We benchmark HiCanu with a focus on the recovery of haplotype diversity, major histocompatibility complex (MHC) variants, satellite DNAs, and segmental duplications. For diploid human genomes sequenced to 30× HiFi coverage, HiCanu achieved superior accuracy and allele recovery compared to the current state of the art. On the effectively haploid CHM13 human cell line, HiCanu achieved an NG50 contig size of 77 Mbp with a per-base consensus accuracy of 99.999% (QV50), surpassing recent assemblies of high-coverage, ultralong Oxford Nanopore Technologies (ONT) reads in terms of both accuracy and continuity. This HiCanu assembly correctly resolves 337 out of 341 validation BACs sampled from known segmental duplications and provides the first preliminary assemblies of nine complete human centromeric regions. Although gaps and errors still remain within the most challenging regions of the genome, these results represent a significant advance toward the complete assembly of human genomes.


Assuntos
Variação Genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Alelos , Animais , Linhagem Celular , Duplicação Cromossômica , DNA de Neoplasias , DNA Satélite , Drosophila/genética , Genoma Humano , Haplótipos , Humanos , Reprodutibilidade dos Testes , Software
3.
Nature ; 533(7602): 200-5, 2016 05 12.
Artigo em Inglês | MEDLINE | ID: mdl-27088604

RESUMO

The whole-genome duplication 80 million years ago of the common ancestor of salmonids (salmonid-specific fourth vertebrate whole-genome duplication, Ss4R) provides unique opportunities to learn about the evolutionary fate of a duplicated vertebrate genome in 70 extant lineages. Here we present a high-quality genome assembly for Atlantic salmon (Salmo salar), and show that large genomic reorganizations, coinciding with bursts of transposon-mediated repeat expansions, were crucial for the post-Ss4R rediploidization process. Comparisons of duplicate gene expression patterns across a wide range of tissues with orthologous genes from a pre-Ss4R outgroup unexpectedly demonstrate far more instances of neofunctionalization than subfunctionalization. Surprisingly, we find that genes that were retained as duplicates after the teleost-specific whole-genome duplication 320 million years ago were not more likely to be retained after the Ss4R, and that the duplicate retention was not influenced to a great extent by the nature of the predicted protein interactions of the gene products. Finally, we demonstrate that the Atlantic salmon assembly can serve as a reference sequence for the study of other salmonids for a range of purposes.


Assuntos
Diploide , Evolução Molecular , Duplicação Gênica/genética , Genes Duplicados/genética , Genoma/genética , Salmo salar/genética , Animais , Elementos de DNA Transponíveis/genética , Feminino , Genômica , Masculino , Modelos Genéticos , Mutagênese/genética , Filogenia , Padrões de Referência , Salmo salar/classificação , Homologia de Sequência
4.
Bioinformatics ; 36(Suppl_1): i111-i118, 2020 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-32657365

RESUMO

MOTIVATION: In this era of exponential data growth, minimizer sampling has become a standard algorithmic technique for rapid genome sequence comparison. This technique yields a sub-linear representation of sequences, enabling their comparison in reduced space and time. A key property of the minimizer technique is that if two sequences share a substring of a specified length, then they can be guaranteed to have a matching minimizer. However, because the k-mer distribution in eukaryotic genomes is highly uneven, minimizer-based tools (e.g. Minimap2, Mashmap) opt to discard the most frequently occurring minimizers from the genome to avoid excessive false positives. By doing so, the underlying guarantee is lost and accuracy is reduced in repetitive genomic regions. RESULTS: We introduce a novel weighted-minimizer sampling algorithm. A unique feature of the proposed algorithm is that it performs minimizer sampling while considering a weight for each k-mer; i.e. the higher the weight of a k-mer, the more likely it is to be selected. By down-weighting frequently occurring k-mers, we are able to meet both objectives: (i) avoid excessive false-positive matches and (ii) maintain the minimizer match guarantee. We tested our algorithm, Winnowmap, using both simulated and real long-read data and compared it to a state-of-the-art long read mapper, Minimap2. Our results demonstrate a reduction in the mapping error-rate from 0.14% to 0.06% in the recently finished human X chromosome (154.3 Mbp), and from 3.6% to 0% within the highly repetitive X centromere (3.1 Mbp). Winnowmap improves mapping accuracy within repeats and achieves these results with sparser sampling, leading to better index compression and competitive runtimes. AVAILABILITY AND IMPLEMENTATION: Winnowmap is built on top of the Minimap2 codebase and is available at https://github.com/marbl/winnowmap.


Assuntos
Compressão de Dados , Software , Algoritmos , Genômica , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Análise de Sequência de DNA
5.
Genome Res ; 27(5): 722-736, 2017 05.
Artigo em Inglês | MEDLINE | ID: mdl-28298431

RESUMO

Long-read single-molecule sequencing has revolutionized de novo genome assembly and enabled the automated reconstruction of reference-quality genomes. However, given the relatively high error rates of such technologies, efficient and accurate assembly of large repeats and closely related haplotypes remains challenging. We address these issues with Canu, a successor of Celera Assembler that is specifically designed for noisy single-molecule sequences. Canu introduces support for nanopore sequencing, halves depth-of-coverage requirements, and improves assembly continuity while simultaneously reducing runtime by an order of magnitude on large genomes versus Celera Assembler 8.2. These advances result from new overlapping and assembly algorithms, including an adaptive overlapping strategy based on tf-idf weighted MinHash and a sparse assembly graph construction that avoids collapsing diverged repeats and haplotypes. We demonstrate that Canu can reliably assemble complete microbial genomes and near-complete eukaryotic chromosomes using either Pacific Biosciences (PacBio) or Oxford Nanopore technologies and achieves a contig NG50 of >21 Mbp on both human and Drosophila melanogaster PacBio data sets. For assembly structures that cannot be linearly represented, Canu provides graph-based assembly outputs in graphical fragment assembly (GFA) format for analysis or integration with complementary phasing and scaffolding techniques. The combination of such highly resolved assembly graphs with long-range scaffolding information promises the complete and automated assembly of complex genomes.


Assuntos
Mapeamento de Sequências Contíguas/métodos , Genômica/métodos , Análise de Sequência de DNA/métodos , Software , Animais , Mapeamento de Sequências Contíguas/normas , Drosophila melanogaster/genética , Genoma Bacteriano , Genômica/normas , Humanos , Sequências Repetitivas de Ácido Nucleico , Análise de Sequência de DNA/normas
6.
PLoS Comput Biol ; 15(8): e1007273, 2019 08.
Artigo em Inglês | MEDLINE | ID: mdl-31433799

RESUMO

Long-read sequencing and novel long-range assays have revolutionized de novo genome assembly by automating the reconstruction of reference-quality genomes. In particular, Hi-C sequencing is becoming an economical method for generating chromosome-scale scaffolds. Despite its increasing popularity, there are limited open-source tools available. Errors, particularly inversions and fusions across chromosomes, remain higher than alternate scaffolding technologies. We present a novel open-source Hi-C scaffolder that does not require an a priori estimate of chromosome number and minimizes errors by scaffolding with the assistance of an assembly graph. We demonstrate higher accuracy than the state-of-the-art methods across a variety of Hi-C library preparations and input assembly sizes. The Python and C++ code for our method is openly available at https://github.com/machinegun/SALSA.


Assuntos
Cromossomos Humanos/genética , Genoma Humano , Genômica/métodos , Algoritmos , Animais , Biologia Computacional , Simulação por Computador , Bases de Dados de Ácidos Nucleicos/estatística & dados numéricos , Biblioteca Genômica , Genômica/estatística & dados numéricos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Sequenciamento de Nucleotídeos em Larga Escala/estatística & dados numéricos , Humanos , Análise de Sequência de DNA/métodos , Análise de Sequência de DNA/estatística & dados numéricos , Software
7.
Nature ; 486(7404): 527-31, 2012 Jun 28.
Artigo em Inglês | MEDLINE | ID: mdl-22722832

RESUMO

Two African apes are the closest living relatives of humans: the chimpanzee (Pan troglodytes) and the bonobo (Pan paniscus). Although they are similar in many respects, bonobos and chimpanzees differ strikingly in key social and sexual behaviours, and for some of these traits they show more similarity with humans than with each other. Here we report the sequencing and assembly of the bonobo genome to study its evolutionary relationship with the chimpanzee and human genomes. We find that more than three per cent of the human genome is more closely related to either the bonobo or the chimpanzee genome than these are to each other. These regions allow various aspects of the ancestry of the two ape species to be reconstructed. In addition, many of the regions that overlap genes may eventually help us understand the genetic basis of phenotypes that humans share with one of the two apes to the exclusion of the other.


Assuntos
Evolução Molecular , Variação Genética/genética , Genoma Humano/genética , Genoma/genética , Pan paniscus/genética , Pan troglodytes/genética , Animais , Elementos de DNA Transponíveis/genética , Duplicação Gênica/genética , Genótipo , Humanos , Dados de Sequência Molecular , Fenótipo , Filogenia , Especificidade da Espécie
8.
BMC Genomics ; 18(1): 95, 2017 01 18.
Artigo em Inglês | MEDLINE | ID: mdl-28100185

RESUMO

BACKGROUND: The first Atlantic cod (Gadus morhua) genome assembly published in 2011 was one of the early genome assemblies exclusively based on high-throughput 454 pyrosequencing. Since then, rapid advances in sequencing technologies have led to a multitude of assemblies generated for complex genomes, although many of these are of a fragmented nature with a significant fraction of bases in gaps. The development of long-read sequencing and improved software now enable the generation of more contiguous genome assemblies. RESULTS: By combining data from Illumina, 454 and the longer PacBio sequencing technologies, as well as integrating the results of multiple assembly programs, we have created a substantially improved version of the Atlantic cod genome assembly. The sequence contiguity of this assembly is increased fifty-fold and the proportion of gap-bases has been reduced fifteen-fold. Compared to other vertebrates, the assembly contains an unusual high density of tandem repeats (TRs). Indeed, retrospective analyses reveal that gaps in the first genome assembly were largely associated with these TRs. We show that 21% of the TRs across the assembly, 19% in the promoter regions and 12% in the coding sequences are heterozygous in the sequenced individual. CONCLUSIONS: The inclusion of PacBio reads combined with the use of multiple assembly programs drastically improved the Atlantic cod genome assembly by successfully resolving long TRs. The high frequency of heterozygous TRs within or in the vicinity of genes in the genome indicate a considerable standing genomic variation in Atlantic cod populations, which is likely of evolutionary importance.


Assuntos
Gadus morhua/genética , Genômica/métodos , Sequências de Repetição em Tandem/genética , Animais , Heterozigoto , Anotação de Sequência Molecular , Regiões Promotoras Genéticas , Análise de Sequência de DNA
9.
BMC Genomics ; 18(1): 541, 2017 07 19.
Artigo em Inglês | MEDLINE | ID: mdl-28724409

RESUMO

BACKGROUND: Long-read and short-read sequencing technologies offer competing advantages for eukaryotic genome sequencing projects. Combinations of both may be appropriate for surveys of within-species genomic variation. METHODS: We developed a hybrid assembly pipeline called "Alpaca" that can operate on 20X long-read coverage plus about 50X short-insert and 50X long-insert short-read coverage. To preclude collapse of tandem repeats, Alpaca relies on base-call-corrected long reads for contig formation. RESULTS: Compared to two other assembly protocols, Alpaca demonstrated the most reference agreement and repeat capture on the rice genome. On three accessions of the model legume Medicago truncatula, Alpaca generated the most agreement to a conspecific reference and predicted tandemly repeated genes absent from the other assemblies. CONCLUSION: Our results suggest Alpaca is a useful tool for investigating structural and copy number variation within de novo assemblies of sampled populations.


Assuntos
Genes de Plantas/genética , Genômica/métodos , Variações do Número de Cópias de DNA , Medicago truncatula/genética , Família Multigênica/genética , Oryza/genética , Fenótipo , Sequências de Repetição em Tandem/genética
10.
Nature ; 464(7288): 592-6, 2010 Mar 25.
Artigo em Inglês | MEDLINE | ID: mdl-20228792

RESUMO

The freshwater cnidarian Hydra was first described in 1702 and has been the object of study for 300 years. Experimental studies of Hydra between 1736 and 1744 culminated in the discovery of asexual reproduction of an animal by budding, the first description of regeneration in an animal, and successful transplantation of tissue between animals. Today, Hydra is an important model for studies of axial patterning, stem cell biology and regeneration. Here we report the genome of Hydra magnipapillata and compare it to the genomes of the anthozoan Nematostella vectensis and other animals. The Hydra genome has been shaped by bursts of transposable element expansion, horizontal gene transfer, trans-splicing, and simplification of gene structure and gene content that parallel simplification of the Hydra life cycle. We also report the sequence of the genome of a novel bacterium stably associated with H. magnipapillata. Comparisons of the Hydra genome to the genomes of other animals shed light on the evolution of epithelia, contractile tissues, developmentally regulated transcription factors, the Spemann-Mangold organizer, pluripotency genes and the neuromuscular junction.


Assuntos
Genoma/genética , Hydra/genética , Animais , Antozoários/genética , Comamonadaceae/genética , Elementos de DNA Transponíveis/genética , Transferência Genética Horizontal/genética , Genoma Bacteriano/genética , Hydra/microbiologia , Hydra/ultraestrutura , Dados de Sequência Molecular , Junção Neuromuscular/ultraestrutura
11.
Nucleic Acids Res ; 41(15): 7387-400, 2013 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-23761445

RESUMO

Anopheles darlingi is the principal neotropical malaria vector, responsible for more than a million cases of malaria per year on the American continent. Anopheles darlingi diverged from the African and Asian malaria vectors ∼100 million years ago (mya) and successfully adapted to the New World environment. Here we present an annotated reference A. darlingi genome, sequenced from a wild population of males and females collected in the Brazilian Amazon. A total of 10 481 predicted protein-coding genes were annotated, 72% of which have their closest counterpart in Anopheles gambiae and 21% have highest similarity with other mosquito species. In spite of a long period of divergent evolution, conserved gene synteny was observed between A. darlingi and A. gambiae. More than 10 million single nucleotide polymorphisms and short indels with potential use as genetic markers were identified. Transposable elements correspond to 2.3% of the A. darlingi genome. Genes associated with hematophagy, immunity and insecticide resistance, directly involved in vector-human and vector-parasite interactions, were identified and discussed. This study represents the first effort to sequence the genome of a neotropical malaria vector, and opens a new window through which we can contemplate the evolutionary history of anopheline mosquitoes. It also provides valuable information that may lead to novel strategies to reduce malaria transmission on the South American continent. The A. darlingi genome is accessible at www.labinfo.lncc.br/index.php/anopheles-darlingi.


Assuntos
Anopheles/genética , Genoma de Inseto , Insetos Vetores/genética , Animais , Anopheles/classificação , Brasil , Cromossomos de Insetos/genética , Elementos de DNA Transponíveis , Evolução Molecular , Feminino , Variação Genética , Interações Hospedeiro-Parasita , Proteínas de Insetos/genética , Insetos Vetores/classificação , Resistência a Inseticidas , Inseticidas/farmacologia , Malária/parasitologia , Masculino , Anotação de Sequência Molecular , Filogenia , Sintenia , Transcriptoma
12.
Proc Natl Acad Sci U S A ; 108(30): 12348-53, 2011 Jul 26.
Artigo em Inglês | MEDLINE | ID: mdl-21709235

RESUMO

The Tasmanian devil (Sarcophilus harrisii) is threatened with extinction because of a contagious cancer known as Devil Facial Tumor Disease. The inability to mount an immune response and to reject these tumors might be caused by a lack of genetic diversity within a dwindling population. Here we report a whole-genome analysis of two animals originating from extreme northwest and southeast Tasmania, the maximal geographic spread, together with the genome from a tumor taken from one of them. A 3.3-Gb de novo assembly of the sequence data from two complementary next-generation sequencing platforms was used to identify 1 million polymorphic genomic positions, roughly one-quarter of the number observed between two genetically distant human genomes. Analysis of 14 complete mitochondrial genomes from current and museum specimens, as well as mitochondrial and nuclear SNP markers in 175 animals, suggests that the observed low genetic diversity in today's population preceded the Devil Facial Tumor Disease disease outbreak by at least 100 y. Using a genetically characterized breeding stock based on the genome sequence will enable preservation of the extant genetic diversity in future Tasmanian devil populations.


Assuntos
Variação Genética , Marsupiais/genética , Animais , Cruzamento , DNA Mitocondrial/genética , DNA de Neoplasias/genética , Extinção Biológica , Neoplasias Faciais/genética , Neoplasias Faciais/veterinária , Genética Populacional , Genoma Mitocondrial , Humanos , Modelos Moleculares , Dados de Sequência Molecular , Proteínas de Neoplasias/química , Proteínas de Neoplasias/genética , Neoplasias/genética , Neoplasias/veterinária , Filogenia , Polimorfismo de Nucleotídeo Único , Tasmânia , Fatores de Tempo
13.
Proc Natl Acad Sci U S A ; 107(27): 12168-73, 2010 Jul 06.
Artigo em Inglês | MEDLINE | ID: mdl-20566863

RESUMO

As an obligatory parasite of humans, the body louse (Pediculus humanus humanus) is an important vector for human diseases, including epidemic typhus, relapsing fever, and trench fever. Here, we present genome sequences of the body louse and its primary bacterial endosymbiont Candidatus Riesia pediculicola. The body louse has the smallest known insect genome, spanning 108 Mb. Despite its status as an obligate parasite, it retains a remarkably complete basal insect repertoire of 10,773 protein-coding genes and 57 microRNAs. Representing hemimetabolous insects, the genome of the body louse thus provides a reference for studies of holometabolous insects. Compared with other insect genomes, the body louse genome contains significantly fewer genes associated with environmental sensing and response, including odorant and gustatory receptors and detoxifying enzymes. The unique architecture of the 18 minicircular mitochondrial chromosomes of the body louse may be linked to the loss of the gene encoding the mitochondrial single-stranded DNA binding protein. The genome of the obligatory louse endosymbiont Candidatus Riesia pediculicola encodes less than 600 genes on a short, linear chromosome and a circular plasmid. The plasmid harbors a unique arrangement of genes required for the synthesis of pantothenate, an essential vitamin deficient in the louse diet. The human body louse, its primary endosymbiont, and the bacterial pathogens that it vectors all possess genomes reduced in size compared with their free-living close relatives. Thus, the body louse genome project offers unique information and tools to use in advancing understanding of coevolution among vectors, symbionts, and pathogens.


Assuntos
Genoma Bacteriano/genética , Genoma de Inseto/genética , Pediculus/genética , Pediculus/microbiologia , Animais , Enterobacteriaceae/genética , Genes Bacterianos/genética , Genes de Insetos/genética , Genômica/métodos , Humanos , Infestações por Piolhos/parasitologia , Dados de Sequência Molecular , Análise de Sequência de DNA , Simbiose
14.
Nat Biotechnol ; 41(10): 1474-1482, 2023 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-36797493

RESUMO

The Telomere-to-Telomere consortium recently assembled the first truly complete sequence of a human genome. To resolve the most complex repeats, this project relied on manual integration of ultra-long Oxford Nanopore sequencing reads with a high-resolution assembly graph built from long, accurate PacBio high-fidelity reads. We have improved and automated this strategy in Verkko, an iterative, graph-based pipeline for assembling complete, diploid genomes. Verkko begins with a multiplex de Bruijn graph built from long, accurate reads and progressively simplifies this graph by integrating ultra-long reads and haplotype-specific markers. The result is a phased, diploid assembly of both haplotypes, with many chromosomes automatically assembled from telomere to telomere. Running Verkko on the HG002 human genome resulted in 20 of 46 diploid chromosomes assembled without gaps at 99.9997% accuracy. The complete assembly of diploid genomes is a critical step towards the construction of comprehensive pangenome databases and chromosome-scale comparative genomics.


Assuntos
Diploide , Genômica , Humanos , Análise de Sequência de DNA/métodos , Genômica/métodos , Genoma Humano/genética , Telômero/genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos
15.
J Bacteriol ; 194(2): 376-94, 2012 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-22056929

RESUMO

We present the draft genome for the Rickettsia endosymbiont of Ixodes scapularis (REIS), a symbiont of the deer tick vector of Lyme disease in North America. Among Rickettsia species (Alphaproteobacteria: Rickettsiales), REIS has the largest genome sequenced to date (>2 Mb) and contains 2,309 genes across the chromosome and four plasmids (pREIS1 to pREIS4). The most remarkable finding within the REIS genome is the extraordinary proliferation of mobile genetic elements (MGEs), which contributes to a limited synteny with other Rickettsia genomes. In particular, an integrative conjugative element named RAGE (for Rickettsiales amplified genetic element), previously identified in scrub typhus rickettsiae (Orientia tsutsugamushi) genomes, is present on both the REIS chromosome and plasmids. Unlike the pseudogene-laden RAGEs of O. tsutsugamushi, REIS encodes nine conserved RAGEs that include F-like type IV secretion systems similar to that of the tra genes encoded in the Rickettsia bellii and R. massiliae genomes. An unparalleled abundance of encoded transposases (>650) relative to genome size, together with the RAGEs and other MGEs, comprise ~35% of the total genome, making REIS one of the most plastic and repetitive bacterial genomes sequenced to date. We present evidence that conserved rickettsial genes associated with an intracellular lifestyle were acquired via MGEs, especially the RAGE, through a continuum of genomic invasions. Robust phylogeny estimation suggests REIS is ancestral to the virulent spotted fever group of rickettsiae. As REIS is not known to invade vertebrate cells and has no known pathogenic effects on I. scapularis, its genome sequence provides insight on the origin of mechanisms of rickettsial pathogenicity.


Assuntos
Regulação Bacteriana da Expressão Gênica/fisiologia , Genoma Bacteriano , Sequências Repetitivas Dispersas , Ixodes/microbiologia , Rickettsia/genética , Animais , Vetores Aracnídeos/microbiologia , Evolução Biológica , Mapeamento Cromossômico , Cromossomos Bacterianos , Dados de Sequência Molecular , Plasmídeos , Simbiose
16.
Bioinformatics ; 27(13): 1869-70, 2011 Jul 01.
Artigo em Inglês | MEDLINE | ID: mdl-21551146

RESUMO

UNLABELLED: The large number of genomes that will be sequenced will need to be annotated with genes and other functional features. Aligning gene sequences from a related species to the target genome is an economical and highly reliable method to identify genes; unfortunately, existing tools have been lacking in sensitivity and speed. A program we reported, sim4cc, was shown to be highly accurate but is limited to comparing one cDNA with one genomic sequence. We present here an optimization of the tool, implemented in the packages sim4db and leaff. The new tool performs batch alignments of cDNA and genomic sequences in a fraction of the time required by its predecessor, and thus is very well suited for genome-wide analyses. AVAILABILITY: Sim4db and leaff are written in C, C++ and Perl for Linux and other Unix platforms. Source code is distributed free of charge from http://sourceforge.net/projects/kmer/. CONTACT: florea@umiacs.umd.edu


Assuntos
Alinhamento de Sequência/métodos , Software , Animais , Sequência de Bases , Evolução Biológica , DNA Complementar/genética , Genoma , Humanos , Splicing de RNA , Vertebrados/genética
17.
PLoS Genet ; 4(8): e1000160, 2008 Aug 15.
Artigo em Inglês | MEDLINE | ID: mdl-18704161

RESUMO

There is much interest in characterizing the variation in a human individual, because this may elucidate what contributes significantly to a person's phenotype, thereby enabling personalized genomics. We focus here on the variants in a person's 'exome,' which is the set of exons in a genome, because the exome is believed to harbor much of the functional variation. We provide an analysis of the approximately 12,500 variants that affect the protein coding portion of an individual's genome. We identified approximately 10,400 nonsynonymous single nucleotide polymorphisms (nsSNPs) in this individual, of which approximately 15-20% are rare in the human population. We predict approximately 1,500 nsSNPs affect protein function and these tend be heterozygous, rare, or novel. Of the approximately 700 coding indels, approximately half tend to have lengths that are a multiple of three, which causes insertions/deletions of amino acids in the corresponding protein, rather than introducing frameshifts. Coding indels also occur frequently at the termini of genes, so even if an indel causes a frameshift, an alternative start or stop site in the gene can still be used to make a functional protein. In summary, we reduced the set of approximately 12,500 nonsilent coding variants by approximately 8-fold to a set of variants that are most likely to have major effects on their proteins' functions. This is our first glimpse of an individual's exome and a snapshot of the current state of personalized genomics. The majority of coding variants in this individual are common and appear to be functionally neutral. Our results also indicate that some variants can be used to improve the current NCBI human reference genome. As more genomes are sequenced, many rare variants and non-SNP variants will be discovered. We present an approach to analyze the coding variation in humans by proposing multiple bioinformatic methods to hone in on possible functional variation.


Assuntos
Éxons , Variação Genética , Genoma Humano , Polimorfismo de Nucleotídeo Único , Frequência do Gene , Doenças Genéticas Inatas/genética , Humanos , Masculino , Mutação , Fenótipo , Proteínas/genética , Proteínas/metabolismo
18.
Sci Rep ; 11(1): 2997, 2021 02 04.
Artigo em Inglês | MEDLINE | ID: mdl-33542373

RESUMO

The rDNA clusters and flanking sequences on human chromosomes 13, 14, 15, 21 and 22 represent large gaps in the current genomic assembly. The organization and the degree of divergence of the human rDNA units within an individual nucleolar organizer region (NOR) are only partially known. To address this lacuna, we previously applied transformation-associated recombination (TAR) cloning to isolate individual rDNA units from chromosome 21. That approach revealed an unexpectedly high level of heterogeneity in human rDNA, raising the possibility of corresponding variations in ribosome dynamics. We have now applied the same strategy to analyze an entire rDNA array end-to-end from a copy of chromosome 22. Sequencing of TAR isolates provided the entire NOR sequence, including proximal and distal junctions that may be involved in nucleolar function. Comparison of the newly sequenced rDNAs to reference sequence for chromosomes 22 and 21 revealed variants that are shared in human rDNA in individuals from different ethnic groups, many of them at high frequency. Analysis infers comparable intra- and inter-individual divergence of rDNA units on the same and different chromosomes, supporting the concerted evolution of rDNA units. The results provide a route to investigate further the role of rDNA variation in nucleolar formation and in the empirical associations of nucleoli with pathology.


Assuntos
Cromossomos Humanos Par 22/genética , DNA Ribossômico/genética , Genoma Humano/genética , Região Organizadora do Nucléolo/genética , Nucléolo Celular/genética , Clonagem Molecular , Heterogeneidade Genética , Genômica , Humanos , Anotação de Sequência Molecular , Ribossomos/genética
19.
BMC Bioinformatics ; 11: 457, 2010 Sep 10.
Artigo em Inglês | MEDLINE | ID: mdl-20831800

RESUMO

BACKGROUND: Finishing is the process of improving the quality and utility of draft genome sequences generated by shotgun sequencing and computational assembly. Finishing can involve targeted sequencing. Finishing reads may be incorporated by manual or automated means. One automated method uses targeted addition by local re-assembly of gap regions. An obvious alternative uses de novo assembly of all the reads. RESULTS: A procedure called the bounding read algorithm was developed for assembly of shotgun reads plus finishing reads and their constraints, targeting repeat regions. The algorithm was implemented within the Celera Assembler software and its pyrosequencing-specific variant, CABOG. The implementation was tested on Sanger and pyrosequencing data from six genomes. The bounding read assemblies were compared to assemblies from two other methods on the same data. The algorithm generates improved assemblies of repeat regions, closing and tiling some gaps while degrading none. CONCLUSIONS: The algorithm is useful for small-genome automated finishing projects. Our implementation is available as open-source from http://wgs-assembler.sourceforge.net under the GNU Public License.


Assuntos
Algoritmos , Análise de Sequência de DNA/métodos , Bases de Dados Factuais , Genoma
20.
PLoS Biol ; 5(10): e254, 2007 Sep 04.
Artigo em Inglês | MEDLINE | ID: mdl-17803354

RESUMO

Presented here is a genome sequence of an individual human. It was produced from approximately 32 million random DNA fragments, sequenced by Sanger dideoxy technology and assembled into 4,528 scaffolds, comprising 2,810 million bases (Mb) of contiguous sequence with approximately 7.5-fold coverage for any given region. We developed a modified version of the Celera assembler to facilitate the identification and comparison of alternate alleles within this individual diploid genome. Comparison of this genome and the National Center for Biotechnology Information human reference assembly revealed more than 4.1 million DNA variants, encompassing 12.3 Mb. These variants (of which 1,288,319 were novel) included 3,213,401 single nucleotide polymorphisms (SNPs), 53,823 block substitutions (2-206 bp), 292,102 heterozygous insertion/deletion events (indels)(1-571 bp), 559,473 homozygous indels (1-82,711 bp), 90 inversions, as well as numerous segmental duplications and copy number variation regions. Non-SNP DNA variation accounts for 22% of all events identified in the donor, however they involve 74% of all variant bases. This suggests an important role for non-SNP genetic alterations in defining the diploid genome structure. Moreover, 44% of genes were heterozygous for one or more variants. Using a novel haplotype assembly strategy, we were able to span 1.5 Gb of genome sequence in segments >200 kb, providing further precision to the diploid nature of the genome. These data depict a definitive molecular portrait of a diploid human genome that provides a starting point for future genome comparisons and enables an era of individualized genomic information.


Assuntos
Mapeamento Cromossômico , Diploide , Genoma Humano , Análise de Sequência de DNA , Sequência de Bases , Mapeamento Cromossômico/instrumentação , Mapeamento Cromossômico/métodos , Cromossomos Humanos , Cromossomos Humanos Y/genética , Dosagem de Genes , Genótipo , Haplótipos , Projeto Genoma Humano , Humanos , Mutação INDEL , Hibridização in Situ Fluorescente , Masculino , Análise em Microsséries , Pessoa de Meia-Idade , Dados de Sequência Molecular , Linhagem , Fenótipo , Polimorfismo de Nucleotídeo Único , Reprodutibilidade dos Testes , Análise de Sequência de DNA/instrumentação , Análise de Sequência de DNA/métodos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA