Search | VHL Search Portal

Launching genomics into the cloud: deployment of Mercury, a next generation sequence analysis pipeline.

Reid, Jeffrey G; Carroll, Andrew; Veeraraghavan, Narayanan; Dahdouli, Mahmoud; Sundquist, Andreas; English, Adam; Bainbridge, Matthew; White, Simon; Salerno, William; Buhay, Christian; Yu, Fuli; Muzny, Donna; Daly, Richard; Duyk, Geoff; Gibbs, Richard A; Boerwinkle, Eric.

BMC Bioinformatics ; 15: 30, 2014 Jan 29.

Article in English | MEDLINE | ID: mdl-24475911

ABSTRACT

BACKGROUND: Massively parallel DNA sequencing generates staggering amounts of data. Decreasing cost, increasing throughput, and improved annotation have expanded the diversity of genomics applications in research and clinical practice. This expanding scale creates analytical challenges: accommodating peak compute demand, coordinating secure access for multiple analysts, and sharing validated tools and results. RESULTS: To address these challenges, we have developed the Mercury analysis pipeline and deployed it in local hardware and the Amazon Web Services cloud via the DNAnexus platform. Mercury is an automated, flexible, and extensible analysis workflow that provides accurate and reproducible genomic results at scales ranging from individuals to large cohorts. CONCLUSIONS: By taking advantage of cloud computing and with Mercury implemented on the DNAnexus platform, we have demonstrated a powerful combination of a robust and fully validated software pipeline and a scalable computational resource that, to date, we have applied to more than 10,000 whole genome and whole exome samples.

Subject(s)

Genomics/methods , High-Throughput Nucleotide Sequencing/methods , Internet , Software , Genome/genetics , Humans

Reconstruction of genealogical relationships with applications to Phase III of HapMap.

Kyriazopoulou-Panagiotopoulou, Sofia; Kashef Haghighi, Dorna; Aerni, Sarah J; Sundquist, Andreas; Bercovici, Sivan; Batzoglou, Serafim.

Bioinformatics ; 27(13): i333-41, 2011 Jul 01.

Article in English | MEDLINE | ID: mdl-21685089

ABSTRACT

MOTIVATION: Accurate inference of genealogical relationships between pairs of individuals is paramount in association studies, forensics and evolutionary analyses of wildlife populations. Current methods for relationship inference consider only a small set of close relationships and have limited to no power to distinguish between relationships with the same number of meioses separating the individuals under consideration (e.g. aunt-niece versus niece-aunt or first cousins versus great aunt-niece). RESULTS: We present CARROT (ClAssification of Relationships with ROTations), a novel framework for relationship inference that leverages linkage information to differentiate between rotated relationships, that is, between relationships with the same number of common ancestors and the same number of meioses separating the individuals under consideration. We demonstrate that CARROT clearly outperforms existing methods on simulated data. We also applied CARROT on four populations from Phase III of the HapMap Project and detected previously unreported pairs of third- and fourth-degree relatives. AVAILABILITY: Source code for CARROT is freely available at http://carrot.stanford.edu. CONTACT: sofiakp@stanford.edu.

Subject(s)

Algorithms , Genealogy and Heraldry , Animals , Humans , Markov Chains

Bacterial flora-typing with targeted, chip-based Pyrosequencing.

Sundquist, Andreas; Bigdeli, Saharnaz; Jalili, Roxana; Druzin, Maurice L; Waller, Sarah; Pullen, Kristin M; El-Sayed, Yasser Y; Taslimi, M Mark; Batzoglou, Serafim; Ronaghi, Mostafa.

BMC Microbiol ; 7: 108, 2007 Nov 30.

Article in English | MEDLINE | ID: mdl-18047683

ABSTRACT

BACKGROUND: The metagenomic analysis of microbial communities holds the potential to improve our understanding of the role of microbes in clinical conditions. Recent, dramatic improvements in DNA sequencing throughput and cost will enable such analyses on individuals. However, such advances in throughput generally come at the cost of shorter read-lengths, limiting the discriminatory power of each read. In particular, classifying the microbial content of samples by sequencing the < 1,600 bp 16S rRNA gene will be affected by such limitations. RESULTS: We describe a method for identifying the phylogenetic content of bacterial samples using high-throughput Pyrosequencing targeted at the 16S rRNA gene. Our analysis is adapted to the shorter read-lengths of such technology and uses a database of 16S rDNA to determine the most specific phylogenetic classification for reads, resulting in a weighted phylogenetic tree characterizing the content of the sample. We present results for six samples obtained from the human vagina during pregnancy that corroborates previous studies using conventional techniques.Next, we analyze the power of our method to classify reads at each level of the phylogeny using simulation experiments. We assess the impacts of read-length and database completeness on our method, and predict how we do as technology improves and more bacteria are sequenced. Finally, we study the utility of targeting specific 16S variable regions and show that such an approach considerably improves results for certain types of microbial samples. Using simulation, our method can be used to determine the most informative variable region. CONCLUSION: This study provides positive validation of the effectiveness of targeting 16S metagenomes using short-read sequencing technology. Our methodology allows us to infer the most specific assignment of the sequence reads within the phylogeny, and to identify the most discriminative variable region to target. The analysis of high-throughput Pyrosequencing on human flora samples will accelerate the study of the relationship between the microbial world and ourselves.

Subject(s)

Bacterial Typing Techniques/methods , Sequence Analysis/methods , DNA Primers , Female , Humans , Phylogeny , Polymorphism, Genetic , Pregnancy , RNA, Bacterial/genetics , RNA, Ribosomal, 16S/genetics , Vagina/microbiology

Effect of genetic divergence in identifying ancestral origin using HAPAA.

Sundquist, Andreas; Fratkin, Eugene; Do, Chuong B; Batzoglou, Serafim.

Genome Res ; 18(4): 676-82, 2008 Apr.

Article in English | MEDLINE | ID: mdl-18353807

ABSTRACT

The genome of an admixed individual with ancestors from isolated populations is a mosaic of chromosomal blocks, each following the statistical properties of variation seen in those populations. By analyzing polymorphisms in the admixed individual against those seen in representatives from the populations, we can infer the ancestral source of the individual's haploblocks. In this paper we describe a novel approach for ancestry inference, HAPAA (HMM-based analysis of polymorphisms in admixed ancestries), that models the allelic and haplotypic variation in the populations and captures the signal of correlation due to linkage disequilibrium, resulting in greatly improved accuracy. We also introduce a methodology for evaluating the effect of genetic divergence between ancestral populations and time-to-admixture on inference accuracy. Using HAPAA, we explore the limits of ancestry inference in closely related populations.

Subject(s)

Ethnicity/genetics , Genetics, Population/methods , Polymorphism, Genetic , Genome, Human , Humans , Linkage Disequilibrium , Markov Chains

Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data.

Valouev, Anton; Johnson, David S; Sundquist, Andreas; Medina, Catherine; Anton, Elizabeth; Batzoglou, Serafim; Myers, Richard M; Sidow, Arend.

Nat Methods ; 5(9): 829-34, 2008 Sep.

Article in English | MEDLINE | ID: mdl-19160518

ABSTRACT

Molecular interactions between protein complexes and DNA mediate essential gene-regulatory functions. Uncovering such interactions by chromatin immunoprecipitation coupled with massively parallel sequencing (ChIP-Seq) has recently become the focus of intense interest. We here introduce quantitative enrichment of sequence tags (QuEST), a powerful statistical framework based on the kernel density estimation approach, which uses ChIP-Seq data to determine positions where protein complexes contact DNA. Using QuEST, we discovered several thousand binding sites for the human transcription factors SRF, GABP and NRSF at an average resolution of about 20 base pairs. MEME motif-discovery tool-based analyses of the QuEST-identified sequences revealed DNA binding by cofactors of SRF, providing evidence that cofactor binding specificity can be obtained from ChIP-Seq data. By combining QuEST analyses with Gene Ontology (GO) annotations and expression data, we illustrate how general functions of transcription factors can be inferred.

Subject(s)

DNA-Binding Proteins/metabolism , Genomics/methods , Transcription Factors/metabolism , Base Sequence , Binding Sites , Chromatin Immunoprecipitation

Whole-genome sequencing and assembly with high-throughput, short-read technologies.

Sundquist, Andreas; Ronaghi, Mostafa; Tang, Haixu; Pevzner, Pavel; Batzoglou, Serafim.

PLoS One ; 2(5): e484, 2007 May 30.

Article in English | MEDLINE | ID: mdl-17534434

ABSTRACT

While recently developed short-read sequencing technologies may dramatically reduce the sequencing cost and eventually achieve the $1000 goal for re-sequencing, their limitations prevent the de novo sequencing of eukaryotic genomes with the standard shotgun sequencing protocol. We present SHRAP (SHort Read Assembly Protocol), a sequencing protocol and assembly methodology that utilizes high-throughput short-read technologies. We describe a variation on hierarchical sequencing with two crucial differences: (1) we select a clone library from the genome randomly rather than as a tiling path and (2) we sample clones from the genome at high coverage and reads from the clones at low coverage. We assume that 200 bp read lengths with a 1% error rate and inexpensive random fragment cloning on whole mammalian genomes is feasible. Our assembly methodology is based on first ordering the clones and subsequently performing read assembly in three stages: (1) local assemblies of regions significantly smaller than a clone size, (2) clone-sized assemblies of the results of stage 1, and (3) chromosome-sized assemblies. By aggressively localizing the assembly problem during the first stage, our method succeeds in assembling short, unpaired reads sampled from repetitive genomes. We tested our assembler using simulated reads from D. melanogaster and human chromosomes 1, 11, and 21, and produced assemblies with large sets of contiguous sequence and a misassembly rate comparable to other draft assemblies. Tested on D. melanogaster and the entire human genome, our clone-ordering method produces accurate maps, thereby localizing fragment assembly and enabling the parallelization of the subsequent steps of our pipeline. Thus, we have demonstrated that truly inexpensive de novo sequencing of mammalian genomes will soon be possible with high-throughput, short-read technologies using our methodology.

Subject(s)

Genome, Human , Sequence Analysis, DNA , Algorithms , Animals , Chromosome Mapping , Drosophila melanogaster/genetics , Humans

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL