Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 24
Filter
1.
Nat Methods ; 15(8): 591-594, 2018 08.
Article in English | MEDLINE | ID: mdl-30013048

ABSTRACT

We describe Strelka2 ( https://github.com/Illumina/strelka ), an open-source small-variant-calling method for research and clinical germline and somatic sequencing applications. Strelka2 introduces a novel mixture-model-based estimation of insertion/deletion error parameters from each sample, an efficient tiered haplotype-modeling strategy, and a normal sample contamination model to improve liquid tumor analysis. For both germline and somatic calling, Strelka2 substantially outperformed the current leading tools in terms of both variant-calling accuracy and computing cost.


Subject(s)
Genetic Variation , Germ-Line Mutation , Software , Databases, Genetic/statistics & numerical data , Haplotypes , High-Throughput Nucleotide Sequencing/statistics & numerical data , Humans , INDEL Mutation , Models, Genetic , Neoplasms/genetics , Whole Genome Sequencing/statistics & numerical data
2.
Genome Res ; 27(1): 157-164, 2017 01.
Article in English | MEDLINE | ID: mdl-27903644

ABSTRACT

Improvement of variant calling in next-generation sequence data requires a comprehensive, genome-wide catalog of high-confidence variants called in a set of genomes for use as a benchmark. We generated deep, whole-genome sequence data of 17 individuals in a three-generation pedigree and called variants in each genome using a range of currently available algorithms. We used haplotype transmission information to create a phased "Platinum" variant catalog of 4.7 million single-nucleotide variants (SNVs) plus 0.7 million small (1-50 bp) insertions and deletions (indels) that are consistent with the pattern of inheritance in the parents and 11 children of this pedigree. Platinum genotypes are highly concordant with the current catalog of the National Institute of Standards and Technology for both SNVs (>99.99%) and indels (99.92%) and add a validated truth catalog that has 26% more SNVs and 45% more indels. Analysis of 334,652 SNVs that were consistent between informatics pipelines yet inconsistent with haplotype transmission ("nonplatinum") revealed that the majority of these variants are de novo and cell-line mutations or reside within previously unidentified duplications and deletions. The reference materials from this study are a resource for objective assessment of the accuracy of variant calls throughout genomes.


Subject(s)
Genome, Human/genetics , Genomics , High-Throughput Nucleotide Sequencing/methods , Sequence Analysis, DNA/methods , Algorithms , Databases, Genetic , Exome/genetics , Genotype , Humans , INDEL Mutation/genetics , Pedigree , Polymorphism, Single Nucleotide , Software
3.
Genet Med ; 22(5): 945-953, 2020 05.
Article in English | MEDLINE | ID: mdl-32066871

ABSTRACT

PURPOSE: Spinal muscular atrophy (SMA), caused by loss of the SMN1 gene, is a leading cause of early childhood death. Due to the near identical sequences of SMN1 and SMN2, analysis of this region is challenging. Population-wide SMA screening to quantify the SMN1 copy number (CN) is recommended by the American College of Medical Genetics and Genomics. METHODS: We developed a method that accurately identifies the CN of SMN1 and SMN2 using genome sequencing (GS) data by analyzing read depth and eight informative reference genome differences between SMN1/2. RESULTS: We characterized SMN1/2 in 12,747 genomes, identified 1568 samples with SMN1 gains or losses and 6615 samples with SMN2 gains or losses, and calculated a pan-ethnic carrier frequency of 2%, consistent with previous studies. Additionally, 99.8% of our SMN1 and 99.7% of SMN2 CN calls agreed with orthogonal methods, with a recall of 100% for SMA and 97.8% for carriers, and a precision of 100% for both SMA and carriers. CONCLUSION: This SMN copy-number caller can be used to identify both carrier and affected status of SMA, enabling SMA testing to be offered as a comprehensive test in neonatal care and an accurate carrier screening tool in GS sequencing projects.


Subject(s)
Muscular Atrophy, Spinal , Base Sequence , Child , Child, Preschool , Humans , Muscular Atrophy, Spinal/diagnosis , Muscular Atrophy, Spinal/genetics , Survival of Motor Neuron 1 Protein/genetics
4.
Nature ; 487(7406): 190-5, 2012 Jul 11.
Article in English | MEDLINE | ID: mdl-22785314

ABSTRACT

Recent advances in whole-genome sequencing have brought the vision of personal genomics and genomic medicine closer to reality. However, current methods lack clinical accuracy and the ability to describe the context (haplotypes) in which genome variants co-occur in a cost-effective manner. Here we describe a low-cost DNA sequencing and haplotyping process, long fragment read (LFR) technology, which is similar to sequencing long single DNA molecules without cloning or separation of metaphase chromosomes. In this study, ten LFR libraries were made using only Ć¢ĀˆĀ¼100 picograms of human DNA per sample. Up to 97% of the heterozygous single nucleotide variants were assembled into long haplotype contigs. Removal of false positive single nucleotide variants not phased by multiple LFR haplotypes resulted in a final genome error rate of 1 in 10 megabases. Cost-effective and accurate genome sequencing and haplotyping from 10-20 human cells, as demonstrated here, will enable comprehensive genetic studies and diverse clinical applications.


Subject(s)
Genome, Human , Genomics/methods , Sequence Analysis, DNA/methods , Alleles , Cell Line , Female , Gene Silencing , Genetic Variation , Haplotypes , Humans , Mutation , Reproducibility of Results , Sequence Analysis, DNA/economics , Sequence Analysis, DNA/standards
5.
Proc Natl Acad Sci U S A ; 107(37): 16184-9, 2010 Sep 14.
Article in English | MEDLINE | ID: mdl-20733077

ABSTRACT

Prochlorococcus describes a diverse and abundant genus of marine photosynthetic microbes. It is primarily found in oligotrophic waters across the globe and plays a crucial role in energy and nutrient cycling in the ocean ecosystem. The abundance, global distribution, and availability of isolates make Prochlorococcus a model system for understanding marine microbial diversity and biogeochemical cycling. Analysis of 73 metagenomic samples from the Global Ocean Sampling expedition acquired in the Atlantic, Pacific, and Indian Oceans revealed the presence of two uncharacterized Prochlorococcus clades. A phylogenetic analysis using six different genetic markers places the clades close to known lineages adapted to high-light environments. The two uncharacterized clades consistently cooccur and dominate the surface waters of high-temperature, macronutrient-replete, and low-iron regions of the Eastern Equatorial Pacific upwelling and the tropical Indian Ocean. They are genetically distinct from each other and other high-light Prochlorococcus isolates and likely define a previously unrecognized ecotype. Our detailed genomic analysis indicates that these clades comprise organisms that are adapted to iron-depleted environments by reducing their iron quota through the loss of several iron-containing proteins that likely function as electron sinks in the photosynthetic pathway in other Prochlorococcus clades from high-light environments. The presence and inferred physiology of these clades may explain why Prochlorococcus populations from iron-depleted regions do not respond to iron fertilization experiments and further expand our understanding of how phytoplankton adapt to variations in nutrient availability in the ocean.


Subject(s)
Iron/metabolism , Prochlorococcus/isolation & purification , Biodiversity , Genome, Bacterial , Oceans and Seas , Phylogeny , Prochlorococcus/genetics , Prochlorococcus/metabolism
6.
J Mol Diagn ; 24(4): 337-350, 2022 04.
Article in English | MEDLINE | ID: mdl-35134542

ABSTRACT

Pharmacogenetic tests typically target selected sequence variants to identify haplotypes that are often defined by star (∗) allele nomenclature. Due to their design, these targeted genotyping assays are unable to detect novel variants that may change the function of the gene product and thereby affect phenotype prediction and patient care. In the current study, 137 DNA samples that were previously characterized by the Genetic Testing Reference Material (GeT-RM) program using a variety of targeted genotyping methods were recharacterized using targeted and whole genome sequencing analysis. Sequence data were analyzed using three genotype calling tools to identify star allele diplotypes for CYP2C8, CYP2C9, and CYP2C19. The genotype calls from next-generation sequencing (NGS) correlated well to those previously reported, except when novel alleles were present in a sample. Six novel alleles and 38 novel suballeles were identified in the three genes due to identification of variants not covered by targeted genotyping assays. In addition, several ambiguous genotype calls from a previous study were resolved using the NGS and/or long-read NGS data. Diplotype calls were mostly consistent between the calling algorithms, although several discrepancies were noted. This study highlights the utility of NGS for pharmacogenetic testing and demonstrates that there are many novel alleles that are yet to be discovered, even in highly characterized genes such as CYP2C9 and CYP2C19.


Subject(s)
Cytochrome P-450 CYP2C19 , Cytochrome P-450 CYP2C8 , Cytochrome P-450 CYP2C9 , Genetic Testing , High-Throughput Nucleotide Sequencing , Alleles , Cytochrome P-450 CYP2C19/genetics , Cytochrome P-450 CYP2C8/genetics , Cytochrome P-450 CYP2C9/genetics , Genotype , Haplotypes/genetics , Humans
7.
PLoS Biol ; 5(4): e101, 2007 Apr.
Article in English | MEDLINE | ID: mdl-17407382

ABSTRACT

Owing to their phylogenetic position, cartilaginous fishes (sharks, rays, skates, and chimaeras) provide a critical reference for our understanding of vertebrate genome evolution. The relatively small genome of the elephant shark, Callorhinchus milii, a chimaera, makes it an attractive model cartilaginous fish genome for whole-genome sequencing and comparative analysis. Here, the authors describe survey sequencing (1.4x coverage) and comparative analysis of the elephant shark genome, one of the first cartilaginous fish genomes to be sequenced to this depth. Repetitive sequences, represented mainly by a novel family of short interspersed element-like and long interspersed element-like sequences, account for about 28% of the elephant shark genome. Fragments of approximately 15,000 elephant shark genes reveal specific examples of genes that have been lost differentially during the evolution of tetrapod and teleost fish lineages. Interestingly, the degree of conserved synteny and conserved sequences between the human and elephant shark genomes are higher than that between human and teleost fish genomes. Elephant shark contains putative four Hox clusters indicating that, unlike teleost fish genomes, the elephant shark genome has not experienced an additional whole-genome duplication. These findings underscore the importance of the elephant shark as a critical reference vertebrate genome for comparative analysis of the human and other vertebrate genomes. This study also demonstrates that a survey-sequencing approach can be applied productively for comparative analysis of distantly related vertebrate genomes.


Subject(s)
Genome , Sharks/genetics , Amino Acid Sequence , Animals , Base Sequence , DNA , Humans , Molecular Sequence Data , Phylogeny , Repetitive Sequences, Nucleic Acid
8.
PLoS Biol ; 5(3): e16, 2007 Mar.
Article in English | MEDLINE | ID: mdl-17355171

ABSTRACT

Metagenomics projects based on shotgun sequencing of populations of micro-organisms yield insight into protein families. We used sequence similarity clustering to explore proteins with a comprehensive dataset consisting of sequences from available databases together with 6.12 million proteins predicted from an assembly of 7.7 million Global Ocean Sampling (GOS) sequences. The GOS dataset covers nearly all known prokaryotic protein families. A total of 3,995 medium- and large-sized clusters consisting of only GOS sequences are identified, out of which 1,700 have no detectable homology to known families. The GOS-only clusters contain a higher than expected proportion of sequences of viral origin, thus reflecting a poor sampling of viral diversity until now. Protein domain distributions in the GOS dataset and current protein databases show distinct biases. Several protein domains that were previously categorized as kingdom specific are shown to have GOS examples in other kingdoms. About 6,000 sequences (ORFans) from the literature that heretofore lacked similarity to known proteins have matches in the GOS data. The GOS dataset is also used to improve remote homology detection. Overall, besides nearly doubling the number of current proteins, the predicted GOS proteins also add a great deal of diversity to known protein families and shed light on their evolution. These observations are illustrated using several protein families, including phosphatases, proteases, ultraviolet-irradiation DNA damage repair enzymes, glutamine synthetase, and RuBisCO. The diversity added by GOS data has implications for choosing targets for experimental structure characterization as part of structural genomics efforts. Our analysis indicates that new families are being discovered at a rate that is linear or almost linear with the addition of new sequences, implying that we are still far from discovering all protein families in nature.


Subject(s)
Proteins/chemistry , Expressed Sequence Tags , Oceans and Seas , Proteins/genetics , Water Microbiology
9.
PLoS Biol ; 5(10): e254, 2007 Sep 04.
Article in English | MEDLINE | ID: mdl-17803354

ABSTRACT

Presented here is a genome sequence of an individual human. It was produced from approximately 32 million random DNA fragments, sequenced by Sanger dideoxy technology and assembled into 4,528 scaffolds, comprising 2,810 million bases (Mb) of contiguous sequence with approximately 7.5-fold coverage for any given region. We developed a modified version of the Celera assembler to facilitate the identification and comparison of alternate alleles within this individual diploid genome. Comparison of this genome and the National Center for Biotechnology Information human reference assembly revealed more than 4.1 million DNA variants, encompassing 12.3 Mb. These variants (of which 1,288,319 were novel) included 3,213,401 single nucleotide polymorphisms (SNPs), 53,823 block substitutions (2-206 bp), 292,102 heterozygous insertion/deletion events (indels)(1-571 bp), 559,473 homozygous indels (1-82,711 bp), 90 inversions, as well as numerous segmental duplications and copy number variation regions. Non-SNP DNA variation accounts for 22% of all events identified in the donor, however they involve 74% of all variant bases. This suggests an important role for non-SNP genetic alterations in defining the diploid genome structure. Moreover, 44% of genes were heterozygous for one or more variants. Using a novel haplotype assembly strategy, we were able to span 1.5 Gb of genome sequence in segments >200 kb, providing further precision to the diploid nature of the genome. These data depict a definitive molecular portrait of a diploid human genome that provides a starting point for future genome comparisons and enables an era of individualized genomic information.


Subject(s)
Chromosome Mapping , Diploidy , Genome, Human , Sequence Analysis, DNA , Base Sequence , Chromosome Mapping/instrumentation , Chromosome Mapping/methods , Chromosomes, Human , Chromosomes, Human, Y/genetics , Gene Dosage , Genotype , Haplotypes , Human Genome Project , Humans , INDEL Mutation , In Situ Hybridization, Fluorescence , Male , Microarray Analysis , Middle Aged , Molecular Sequence Data , Pedigree , Phenotype , Polymorphism, Single Nucleotide , Reproducibility of Results , Sequence Analysis, DNA/instrumentation , Sequence Analysis, DNA/methods
10.
PLoS Biol ; 5(3): e77, 2007 Mar.
Article in English | MEDLINE | ID: mdl-17355176

ABSTRACT

The world's oceans contain a complex mixture of micro-organisms that are for the most part, uncharacterized both genetically and biochemically. We report here a metagenomic study of the marine planktonic microbiota in which surface (mostly marine) water samples were analyzed as part of the Sorcerer II Global Ocean Sampling expedition. These samples, collected across a several-thousand km transect from the North Atlantic through the Panama Canal and ending in the South Pacific yielded an extensive dataset consisting of 7.7 million sequencing reads (6.3 billion bp). Though a few major microbial clades dominate the planktonic marine niche, the dataset contains great diversity with 85% of the assembled sequence and 57% of the unassembled data being unique at a 98% sequence identity cutoff. Using the metadata associated with each sample and sequencing library, we developed new comparative genomic and assembly methods. One comparative genomic method, termed "fragment recruitment," addressed questions of genome structure, evolution, and taxonomic or phylogenetic diversity, as well as the biochemical diversity of genes and gene families. A second method, termed "extreme assembly," made possible the assembly and reconstruction of large segments of abundant but clearly nonclonal organisms. Within all abundant populations analyzed, we found extensive intra-ribotype diversity in several forms: (1) extensive sequence variation within orthologous regions throughout a given genome; despite coverage of individual ribotypes approaching 500-fold, most individual sequencing reads are unique; (2) numerous changes in gene content some with direct adaptive implications; and (3) hypervariable genomic islands that are too variable to assemble. The intra-ribotype diversity is organized into genetically isolated populations that have overlapping but independent distributions, implying distinct environmental preference. We present novel methods for measuring the genomic similarity between metagenomic samples and show how they may be grouped into several community types. Specific functional adaptations can be identified both within individual ribotypes and across the entire community, including proteorhodopsin spectral tuning and the presence or absence of the phosphate-binding gene PstS.


Subject(s)
Water Microbiology , Computational Biology , Food Chain , Oceans and Seas , Plankton , Species Specificity
11.
Nature ; 431(7011): 927-30, 2004 Oct 21.
Article in English | MEDLINE | ID: mdl-15496912

ABSTRACT

Complex eukaryotic genomes are now being sequenced at an accelerated pace primarily using whole-genome shotgun (WGS) sequence assembly approaches. WGS assembly was initially criticized because of its perceived inability to resolve repeat structures within genomes. Here, we quantify the effect of WGS sequence assembly on large, highly similar repeats by comparison of the segmental duplication content of two different human genome assemblies. Our analysis shows that large (> 15 kilobases) and highly identical (> 97%) duplications are not adequately resolved by WGS assembly. This leads to significant reduction in genome length and the loss of genes embedded within duplications. Comparable analyses of mouse genome assemblies confirm that strict WGS sequence assembly will oversimplify our understanding of mammalian genome structure and evolution; a hybrid strategy using a targeted clone-by-clone approach to resolve duplications is proposed.


Subject(s)
Gene Duplication , Genome, Human , Genomics/methods , Physical Chromosome Mapping/methods , Sequence Analysis, DNA/methods , Animals , Chromosomes, Human/genetics , Computational Biology/methods , Genes, Duplicate/genetics , Humans , Mice , Sensitivity and Specificity , Sequence Alignment
12.
PLoS Genet ; 3(9): 1702-8, 2007 Sep.
Article in English | MEDLINE | ID: mdl-17892324

ABSTRACT

Since only a small fraction of environmental bacteria are amenable to laboratory culture, there is great interest in genomic sequencing directly from single cells. Sufficient DNA for sequencing can be obtained from one cell by the Multiple Displacement Amplification (MDA) method, thereby eliminating the need to develop culture methods. Here we used a microfluidic device to isolate individual Escherichia coli and amplify genomic DNA by MDA in 60-nl reactions. Our results confirm a report that reduced MDA reaction volume lowers nonspecific synthesis that can result from contaminant DNA templates and unfavourable interaction between primers. The quality of the genome amplification was assessed by qPCR and compared favourably to single-cell amplifications performed in standard 50-microl volumes. Amplification bias was greatly reduced in nanoliter volumes, thereby providing a more even representation of all sequences. Single-cell amplicons from both microliter and nanoliter volumes provided high-quality sequence data by high-throughput pyrosequencing, thereby demonstrating a straightforward route to sequencing genomes from single cells.


Subject(s)
Gene Amplification , Genome , Nanotechnology , In Situ Hybridization, Fluorescence , Microfluidics , RNA Probes
13.
Bioinformatics ; 24(8): 1035-40, 2008 Apr 15.
Article in English | MEDLINE | ID: mdl-18321888

ABSTRACT

MOTIVATION: We present an algorithm to identify allelic variation given a Whole Genome Shotgun (WGS) assembly of haploid sequences, and to produce a set of haploid consensus sequences rather than a single consensus sequence. Existing WGS assemblers take a column-by-column approach to consensus generation, and produce a single consensus sequence which can be inconsistent with the underlying haploid alleles, and inconsistent with any of the aligned sequence reads. Our new algorithm uses a dynamic windowing approach. It detects alleles by simultaneously processing the portions of aligned reads spanning a region of sequence variation, assigns reads to their respective alleles, phases adjacent variant alleles and generates a consensus sequence corresponding to each confirmed allele. This algorithm was used to produce the first diploid genome sequence of an individual human. It can also be applied to assemblies of multiple diploid individuals and hybrid assemblies of multiple haploid organisms. RESULTS: Being applied to the individual human genome assembly, the new algorithm detects exactly two confirmed alleles and reports two consensus sequences in 98.98% of the total number 2,033311 detected regions of sequence variation. In 33,269 out of 460,373 detected regions of size >1 bp, it fixes the constructed errors of a mosaic haploid representation of a diploid locus as produced by the original Celera Assembler consensus algorithm. Using an optimized procedure calibrated against 1 506 344 known SNPs, it detects 438 814 new heterozygous SNPs with false positive rate 12%. AVAILABILITY: The open source code is available at: http://wgs-assembler.cvs.sourceforge.net/wgs-assembler/


Subject(s)
Algorithms , Chromosome Mapping/methods , Consensus Sequence/genetics , DNA Mutational Analysis/methods , Genetic Variation/genetics , Genome, Human/genetics , Haploidy , Software , Base Sequence , Gene Frequency/genetics , Humans , Molecular Sequence Data , Sequence Analysis, DNA/methods
14.
Environ Microbiol ; 10(9): 2200-10, 2008 Sep.
Article in English | MEDLINE | ID: mdl-18637951

ABSTRACT

The study of microbial diversity patterns is hampered by the enormous diversity of microbial communities and the lack of resources to sample them exhaustively. For many questions about richness and evenness, however, one only needs to know the relative order of diversity among samples rather than total diversity. We used 16S libraries from the Global Ocean Survey to investigate the ability of 10 diversity statistics (including rarefaction, non-parametric, parametric, curve extrapolation and diversity indices) to assess the relative diversity of six aquatic bacterial communities. Overall, we found that the statistics yielded remarkably similar rankings of the samples for a given sequence similarity cut-off. This correspondence, despite the different underlying assumptions of the statistics, suggests that diversity statistics are a useful tool for ranking samples of microbial diversity. In addition, sequence similarity cut-off influenced the diversity ranking of the samples, demonstrating that diversity statistics can also be used to detect differences in phylogenetic structure among microbial communities. Finally, a subsampling analysis suggests that further sequencing from these particular clone libraries would not have substantially changed the richness rankings of the samples.


Subject(s)
Bacteria/genetics , Biodiversity , RNA, Ribosomal, 16S/genetics , Statistics as Topic/methods , Water Microbiology , Bacteria/classification , DNA, Bacterial/genetics , Environmental Monitoring , Gene Library , Genes, rRNA , Sample Size , Sequence Alignment , Sequence Analysis, DNA
15.
J Comput Biol ; 19(3): 279-92, 2012 Mar.
Article in English | MEDLINE | ID: mdl-22175250

ABSTRACT

Unchained base reads on self-assembling DNA nanoarrays have recently emerged as a promising approach to low-cost, high-quality resequencing of human genomes. Because of unique characteristics of these mated pair reads, existing computational methods for resequencing assembly, such as those based on map-consensus calling, are not adequate for accurate variant calling. We describe novel computational methods developed for accurate calling of SNPs and short substitutions and indels (<100 bp); the same methods apply to evaluation of hypothesized larger, structural variations. We use an optimization process that iteratively adjusts the genome sequence to maximize its a posteriori probability given the observed reads. For each candidate sequence, this probability is computed using Bayesian statistics with a simple read generation model and simplifying assumptions that make the problem computationally tractable. The optimization process iteratively applies one-base substitutions, insertions, and deletions until convergence is achieved to an optimum diploid sequence. A local de novo assembly procedure that generalizes approaches based on De Bruijn graphs is used to seed the optimization process in order to reduce the chance of converging to local optima. Finally, a correlation-based filter is applied to reduce the false positive rate caused by the presence of repetitive regions in the reference genome.


Subject(s)
Contig Mapping/methods , Genome, Human , Sequence Analysis, DNA/methods , Algorithms , Alleles , Base Sequence , Bayes Theorem , Chromosome Mapping , Computer Simulation , Data Interpretation, Statistical , Humans , Models, Genetic
16.
ISME J ; 6(6): 1186-99, 2012 Jun.
Article in English | MEDLINE | ID: mdl-22170421

ABSTRACT

Bacteria in the 16S rRNA clade SAR86 are among the most abundant uncultivated constituents of microbial assemblages in the surface ocean for which little genomic information is currently available. Bioinformatic techniques were used to assemble two nearly complete genomes from marine metagenomes and single-cell sequencing provided two more partial genomes. Recruitment of metagenomic data shows that these SAR86 genomes substantially increase our knowledge of non-photosynthetic bacteria in the surface ocean. Phylogenomic analyses establish SAR86 as a basal and divergent lineage of ƎĀ³-proteobacteria, and the individual genomes display a temperature-dependent distribution. Modestly sized at 1.25-1.7 Mbp, the SAR86 genomes lack several pathways for amino-acid and vitamin synthesis as well as sulfate reduction, trends commonly observed in other abundant marine microbes. SAR86 appears to be an aerobic chemoheterotroph with the potential for proteorhodopsin-based ATP generation, though the apparent lack of a retinal biosynthesis pathway may require it to scavenge exogenously-derived pigments to utilize proteorhodopsin. The genomes contain an expanded capacity for the degradation of lipids and carbohydrates acquired using a wealth of tonB-dependent outer membrane receptors. Like the abundant planktonic marine bacterial clade SAR11, SAR86 exhibits metabolic streamlining, but also a distinct carbon compound specialization, possibly avoiding competition.


Subject(s)
Gammaproteobacteria/classification , Metagenomics , Phylogeny , Seawater/microbiology , Computational Biology , Gammaproteobacteria/genetics , Gammaproteobacteria/metabolism , Genome, Bacterial , Genomic Library , Oceans and Seas , Plankton/genetics , RNA, Ribosomal, 16S/genetics , Rhodopsin , Rhodopsins, Microbial
17.
Science ; 327(5961): 78-81, 2010 Jan 01.
Article in English | MEDLINE | ID: mdl-19892942

ABSTRACT

Genome sequencing of large numbers of individuals promises to advance the understanding, treatment, and prevention of human diseases, among other applications. We describe a genome sequencing platform that achieves efficient imaging and low reagent consumption with combinatorial probe anchor ligation chemistry to independently assay each base from patterned nanoarrays of self-assembling DNA nanoballs. We sequenced three human genomes with this platform, generating an average of 45- to 87-fold coverage per genome and identifying 3.2 to 4.5 million sequence variants per genome. Validation of one genome data set demonstrates a sequence accuracy of about 1 false variant per 100 kilobases. The high accuracy, affordable cost of $4400 for sequencing consumables, and scalability of this platform enable complete human genome sequencing for the detection of rare variants in large-scale genetic studies.


Subject(s)
DNA/chemistry , Genome, Human , Microarray Analysis , Sequence Analysis, DNA/methods , Base Sequence , Computational Biology , Costs and Cost Analysis , DNA/genetics , Databases, Nucleic Acid , Genomic Library , Genotype , Haplotypes , Human Genome Project , Humans , Male , Nanostructures , Nanotechnology , Nucleic Acid Amplification Techniques , Polymorphism, Single Nucleotide , Sequence Analysis, DNA/economics , Sequence Analysis, DNA/instrumentation , Sequence Analysis, DNA/standards , Software
18.
Genome Res ; 18(8): 1336-46, 2008 Aug.
Article in English | MEDLINE | ID: mdl-18676820

ABSTRACT

In comparison to genotypes, knowledge about haplotypes (the combination of alleles present on a single chromosome) is much more useful for whole-genome association studies and for making inferences about human evolutionary history. Haplotypes are typically inferred from population genotype data using computational methods. Whole-genome sequence data represent a promising resource for constructing haplotypes spanning hundreds of kilobases for an individual. In this article, we propose a Markov chain Monte Carlo (MCMC) algorithm, HASH (haplotype assembly for single human), for assembling haplotypes from sequenced DNA fragments that have been mapped to a reference genome assembly. The transitions of the Markov chain are generated using min-cut computations on graphs derived from the sequenced fragments. We have applied our method to infer haplotypes using whole-genome shotgun sequence data from a recently sequenced human individual. The high sequence coverage and presence of mate pairs result in fairly long haplotypes (N50 length ~ 350 kb). Based on comparison of the sequenced fragments against the individual haplotypes, we demonstrate that the haplotypes for this individual inferred using HASH are significantly more accurate than the haplotypes estimated using a previously proposed greedy heuristic and a simple MCMC method. Using haplotypes from the HapMap project, we estimate the switch error rate of the haplotypes inferred using HASH to be quite low, ~1.1%. Our Markov chain Monte Carlo algorithm represents a general framework for haplotype assembly that can be applied to sequence data generated by other sequencing technologies. The code implementing the methods and the phased individual haplotypes can be downloaded from (http://www.cse.ucsd.edu/users/vibansal/HASH/).


Subject(s)
Algorithms , Genome, Human , Genomics/methods , Haplotypes , Markov Chains , Monte Carlo Method , Computer Simulation , Humans
19.
PLoS One ; 3(1): e1456, 2008 Jan 23.
Article in English | MEDLINE | ID: mdl-18213365

ABSTRACT

Viruses are the most abundant biological entities on our planet. Interactions between viruses and their hosts impact several important biological processes in the world's oceans such as horizontal gene transfer, microbial diversity and biogeochemical cycling. Interrogation of microbial metagenomic sequence data collected as part of the Sorcerer II Global Ocean Expedition (GOS) revealed a high abundance of viral sequences, representing approximately 3% of the total predicted proteins. Cluster analyses of the viral sequences revealed hundreds to thousands of viral genes encoding various metabolic and cellular functions. Quantitative analyses of viral genes of host origin performed on the viral fraction of aquatic samples confirmed the viral nature of these sequences and suggested that significant portions of aquatic viral communities behave as reservoirs of such genetic material. Distributional and phylogenetic analyses of these host-derived viral sequences also suggested that viral acquisition of environmentally relevant genes of host origin is a more abundant and widespread phenomenon than previously appreciated. The predominant viral sequences identified within microbial fractions originated from tailed bacteriophages and exhibited varying global distributions according to viral family. Recruitment of GOS viral sequence fragments against 27 complete aquatic viral genomes revealed that only one reference bacteriophage genome was highly abundant and was closely related, but not identical, to the cyanomyovirus P-SSM4. The co-distribution across all sampling sites of P-SSM4-like sequences with the dominant ecotype of its host, Prochlorococcus supports the classification of the viral sequences as P-SSM4-like and suggests that this virus may influence the abundance, distribution and diversity of one of the most dominant components of picophytoplankton in oligotrophic oceans. In summary, the abundance and broad geographical distribution of viral sequences within microbial fractions, the prevalence of genes among viral sequences that encode microbial physiological function and their distinct phylogenetic distribution lend strong support to the notion that viral-mediated gene acquisition is a common and ongoing mechanism for generating microbial diversity in the marine environment.


Subject(s)
Genome, Viral , Water Microbiology , Genetic Linkage , Oceans and Seas , Phylogeny
20.
Science ; 314(5807): 1892, 2006 Dec 22.
Article in English | MEDLINE | ID: mdl-17185593

ABSTRACT

Cartilaginous fishes represent the living group of jawed vertebrates that diverged from the common ancestor of human and teleost fish lineages about 530 million years ago. We generated approximately 1.4x genome sequence coverage for a cartilaginous fish, the elephant shark (Callorhinchus milii), and compared this genome with the human genome to identify conserved noncoding elements (CNEs). The elephant shark sequence revealed twice as many CNEs as were identified by whole-genome comparisons between teleost fishes and human. The ancient vertebrate-specific CNEs in the elephant shark and human genomes are likely to play key regulatory roles in vertebrate gene expression.


Subject(s)
Conserved Sequence , Genome, Human , Regulatory Sequences, Nucleic Acid , Sharks/genetics , Animals , Base Sequence , DNA, Intergenic , Enhancer Elements, Genetic , Evolution, Molecular , Genome , Humans , Molecular Sequence Data , Takifugu/genetics , Zebrafish/genetics
SELECTION OF CITATIONS
SEARCH DETAIL