RESUMEN
[This corrects the article DOI: 10.1371/journal.pgen.1000832.].
RESUMEN
The seminal importance of DNA sequencing to the life sciences, biotechnology and medicine has driven the search for more scalable and lower-cost solutions. Here we describe a DNA sequencing technology in which scalable, low-cost semiconductor manufacturing techniques are used to make an integrated circuit able to directly perform non-optical DNA sequencing of genomes. Sequence data are obtained by directly sensing the ions produced by template-directed DNA polymerase synthesis using all-natural nucleotides on this massively parallel semiconductor-sensing device or ion chip. The ion chip contains ion-sensitive, field-effect transistor-based sensors in perfect register with 1.2 million wells, which provide confinement and allow parallel, simultaneous detection of independent sequencing reactions. Use of the most widely used technology for constructing integrated circuits, the complementary metal-oxide semiconductor (CMOS) process, allows for low-cost, large-scale production and scaling of the device to higher densities and larger array sizes. We show the performance of the system by sequencing three bacterial genomes, its robustness and scalability by producing ion chips with up to 10 times as many sensors and sequencing a human genome.
Asunto(s)
Genoma Bacteriano/genética , Genoma Humano/genética , Genómica/instrumentación , Genómica/métodos , Semiconductores , Análisis de Secuencia de ADN/instrumentación , Análisis de Secuencia de ADN/métodos , Escherichia coli/genética , Humanos , Luz , Masculino , Rhodopseudomonas/genética , Vibrio/genéticaRESUMEN
U87MG is a commonly studied grade IV glioma cell line that has been analyzed in at least 1,700 publications over four decades. In order to comprehensively characterize the genome of this cell line and to serve as a model of broad cancer genome sequencing, we have generated greater than 30x genomic sequence coverage using a novel 50-base mate paired strategy with a 1.4kb mean insert library. A total of 1,014,984,286 mate-end and 120,691,623 single-end two-base encoded reads were generated from five slides. All data were aligned using a custom designed tool called BFAST, allowing optimal color space read alignment and accurate identification of DNA variants. The aligned sequence reads and mate-pair information identified 35 interchromosomal translocation events, 1,315 structural variations (>100 bp), 191,743 small (<21 bp) insertions and deletions (indels), and 2,384,470 single nucleotide variations (SNVs). Among these observations, the known homozygous mutation in PTEN was robustly identified, and genes involved in cell adhesion were overrepresented in the mutated gene list. Data were compared to 219,187 heterozygous single nucleotide polymorphisms assayed by Illumina 1M Duo genotyping array to assess accuracy: 93.83% of all SNPs were reliably detected at filtering thresholds that yield greater than 99.99% sequence accuracy. Protein coding sequences were disrupted predominantly in this cancer cell line due to small indels, large deletions, and translocations. In total, 512 genes were homozygously mutated, including 154 by SNVs, 178 by small indels, 145 by large microdeletions, and 35 by interchromosomal translocations to reveal a highly mutated cell line genome. Of the small homozygously mutated variants, 8 SNVs and 99 indels were novel events not present in dbSNP. These data demonstrate that routine generation of broad cancer genome sequence is possible outside of genome centers. The sequence analysis of U87MG provides an unparalleled level of mutational resolution compared to any cell line to date.
Asunto(s)
Línea Celular Tumoral/química , Genoma Humano , Glioma/genética , Línea Celular Tumoral/citología , Genotipo , Humanos , Datos de Secuencia Molecular , Mutación , Polimorfismo de Nucleótido Simple , Proteínas/genética , Análisis de Secuencia de ADNRESUMEN
Otosclerosis is a common form of progressive hearing loss, characterized by abnormal bone remodeling in the otic capsule. The etiology of the disease is largely unknown, and both environmental and genetic factors have been implicated. To identify genetic factors involved in otosclerosis, we used a case-control discovery group to complete a genome-wide association (GWA) study with 555,000 single-nucleotide polymorphisms (SNPs), utilizing pooled DNA samples. By individual genotyping of the top 250 SNPs in a stepwise strategy, we were able to identify two highly associated SNPs that replicated in two additional independent populations. We then genotyped 79 tagSNPs to fine map the two genomic regions defined by the associated SNPs. The region with the strongest association signal, p(combined) = 6.23 x 10(-10), is on chromosome 7q22.1 and spans intron 1 to intron 4 of reelin (RELN), a gene known for its role in neuronal migration. Evidence for allelic heterogeneity was found in this region. Consistent with the GWA data, expression of RELN was confirmed in the inner ear and in stapes footplate specimens. In conclusion, we provide evidence that implicates RELN in the pathogenesis of otosclerosis.
Asunto(s)
Moléculas de Adhesión Celular Neuronal/genética , Proteínas de la Matriz Extracelular/genética , Genoma Humano , Proteínas del Tejido Nervioso/genética , Otosclerosis/genética , Serina Endopeptidasas/genética , Estudios de Casos y Controles , Moléculas de Adhesión Celular Neuronal/biosíntesis , Oído Interno/metabolismo , Proteínas de la Matriz Extracelular/biosíntesis , Femenino , Estudio de Asociación del Genoma Completo , Humanos , Masculino , Proteínas del Tejido Nervioso/biosíntesis , Otosclerosis/metabolismo , Polimorfismo de Nucleótido Simple , Proteína Reelina , Serina Endopeptidasas/biosíntesisRESUMEN
Rapidly evolving sequencing technologies produce data on an unparalleled scale. A central challenge to the analysis of this data is sequence alignment, whereby sequence reads must be compared to a reference. A wide variety of alignment algorithms and software have been subsequently developed over the past two years. In this article, we will systematically review the current development of these algorithms and introduce their practical applications on different types of experimental data. We come to the conclusion that short-read alignment is no longer the bottleneck of data analyses. We also consider future development of alignment algorithms with respect to emerging long sequence reads and the prospect of cloud computing.
Asunto(s)
Algoritmos , Secuencia de Bases , Alineación de Secuencia/métodos , Análisis de Secuencia de ADN/métodos , Genoma Humano , Humanos , Programas InformáticosRESUMEN
Forensic genetic genealogy (FGG) has primarily relied upon dense single nucleotide polymorphism (SNP) profiles from forensic samples or unidentified human remains queried against online genealogy database(s) of known profiles generated with SNP microarrays or from whole genome sequencing (WGS). In these queries, SNPs are compared to database samples by locating contiguous stretches of shared SNP alleles that allow for detection of genomic segments that are identical by descent (IBD) among biological relatives (kinship). This segment-based approach, while robust for detecting distant relationships, generally requires DNA quantity and/or quality that are sometimes not available in forensic casework samples. By focusing on SNPs with maximal discriminatory power and using an algorithm designed for a sparser SNP set than those from microarray typing, performance similar to segment matching was reached even in difficult casework samples. This algorithm locates shared segments using kinship coefficients in "windows" across the genome. The windowed kinship algorithm is a modification of the PC-AiR and PC-Relate tools for genetic relatedness inference, referred to here as the "whole genome kinship" approach, that control for the presence of unknown or unspecified population substructure. Simulated and empirical data in this study, using DNA profiles comprised of 10,230 SNPs (10K multiplex) targeted by the ForenSeq™ Kintelligence Kit demonstrate that the windowed kinship approach performs comparably to segment matching for identifying first, second and third degree relationships, reasonably well for fourth degree relationships, and with fewer false kinship associations. Selection criteria for the 10K SNP PCR-based multiplex and functionality of the windowed kinship algorithm are described.
Asunto(s)
Dermatoglifia del ADN , Polimorfismo de Nucleótido Simple , Humanos , Linaje , Alelos , Reacción en Cadena de la PolimerasaRESUMEN
We developed a generalized framework for multiplexed resequencing of targeted human genome regions on the Illumina Genome Analyzer using degenerate indexed DNA bar codes ligated to fragmented DNA before sequencing. Using this method, we simultaneously sequenced the DNA of multiple HapMap individuals at several Encyclopedia of DNA Elements (ENCODE) regions. We then evaluated the use of Bayes factors for discovering and genotyping polymorphisms. For polymorphisms that were either previously identified within the Single Nucleotide Polymorphism database (dbSNP) or visually evident upon re-inspection of archived ENCODE traces, we observed a false positive rate of 11.3% using strict thresholds for predicting variants and 69.6% for lax thresholds. Conversely, false negative rates were 10.8-90.8%, with false negatives at stricter cut-offs occurring at lower coverage (<10 aligned reads). These results suggest that >90% of genetic variants are discoverable using multiplexed sequencing provided sufficient coverage at the polymorphic base.
Asunto(s)
Procesamiento Automatizado de Datos , Variación Genética , Genoma Humano , Humanos , Polimorfismo de Nucleótido Simple , Alineación de SecuenciaRESUMEN
We use high-density single nucleotide polymorphism (SNP) genotyping microarrays to demonstrate the ability to accurately and robustly determine whether individuals are in a complex genomic DNA mixture. We first develop a theoretical framework for detecting an individual's presence within a mixture, then show, through simulations, the limits associated with our method, and finally demonstrate experimentally the identification of the presence of genomic DNA of specific individuals within a series of highly complex genomic mixtures, including mixtures where an individual contributes less than 0.1% of the total genomic DNA. These findings shift the perceived utility of SNPs for identifying individual trace contributors within a forensics mixture, and suggest future research efforts into assessing the viability of previously sub-optimal DNA sources due to sample contamination. These findings also suggest that composite statistics across cohorts, such as allele frequency or genotype counts, do not mask identity within genome-wide association studies. The implications of these findings are discussed.
Asunto(s)
Genética Médica , Análisis de Secuencia por Matrices de Oligonucleótidos/métodos , Polimorfismo de Nucleótido Simple , Simulación por Computador , Genoma Humano , Genotipo , HumanosRESUMEN
BACKGROUND: DNA sequence comparison is a well-studied problem, in which two DNA sequences are compared using a weighted edit distance. Recent DNA sequencing technologies however observe an encoded form of the sequence, rather than each DNA base individually. The encoded DNA sequence may contain technical errors, and therefore encoded sequencing errors must be incorporated when comparing an encoded DNA sequence to a reference DNA sequence. RESULTS: Although two-base encoding is currently used in practice, many other encoding schemes are possible, whereby two ore more bases are encoded at a time. A generalized k-base encoding scheme is presented, whereby feasible higher order encodings are better able to differentiate errors in the encoded sequence from true DNA sequence variants. A generalized version of the previous two-base encoding DNA sequence comparison algorithm is used to compare a k-base encoded sequence to a DNA reference sequence. Finally, simulations are performed to evaluate the power, the false positive and false negative SNP discovery rates, and the performance time of k-base encoding compared to previous methods as well as to the standard DNA sequence comparison algorithm. CONCLUSIONS: The novel generalized k-base encoding scheme and resulting local alignment algorithm permits the development of higher fidelity ligation-based next generation sequencing technology. This bioinformatic solution affords greater robustness to errors, as well as lower false SNP discovery rates, only at the cost of computational time.
Asunto(s)
Algoritmos , Alineación de Secuencia/métodos , Análisis de Secuencia de ADN/métodos , Simulación por Computador , ADN/genéticaRESUMEN
SUMMARY: The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences, supporting short and long reads (up to 128 Mbp) produced by different sequencing platforms. It is flexible in style, compact in size, efficient in random access and is the format in which alignments from the 1000 Genomes Project are released. SAMtools implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments. AVAILABILITY: http://samtools.sourceforge.net.
Asunto(s)
Biología Computacional/métodos , Alineación de Secuencia/métodos , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Algoritmos , Secuencia de Bases , Genoma , Genómica , Datos de Secuencia MolecularRESUMEN
BACKGROUND: DNA sequence comparison is based on optimal local alignment of two sequences using a similarity score. However, some new DNA sequencing technologies do not directly measure the base sequence, but rather an encoded form, such as the two-base encoding considered here. In order to compare such data to a reference sequence, the data must be decoded into sequence. The decoding is deterministic, but the possibility of measurement errors requires searching among all possible error modes and resulting alignments to achieve an optimal balance of fewer errors versus greater sequence similarity. RESULTS: We present an extension of the standard dynamic programming method for local alignment, which simultaneously decodes the data and performs the alignment, maximizing a similarity score based on a weighted combination of errors and edits, and allowing an affine gap penalty. We also present simulations that demonstrate the performance characteristics of our two base encoded alignment method and contrast those with standard DNA sequence alignment under the same conditions. CONCLUSION: The new local alignment algorithm for two-base encoded data has substantial power to properly detect and correct measurement errors while identifying underlying sequence variants, and facilitating genome re-sequencing efforts based on this form of sequence data.
Asunto(s)
Algoritmos , ADN/química , Alineación de Secuencia/métodos , Análisis de Secuencia de ADN/métodos , Secuencia de BasesRESUMEN
BACKGROUND: The emergence of next-generation sequencing technology presents tremendous opportunities to accelerate the discovery of rare variants or mutations that underlie human genetic disorders. Although the complete sequencing of the affected individuals' genomes would be the most powerful approach to finding such variants, the cost of such efforts make it impractical for routine use in disease gene research. In cases where candidate genes or loci can be defined by linkage, association, or phenotypic studies, the practical sequencing target can be made much smaller than the whole genome, and it becomes critical to have capture methods that can be used to purify the desired portion of the genome for shotgun short-read sequencing without biasing allelic representation or coverage. One major approach is array-based capture which relies on the ability to create a custom in-situ synthesized oligonucleotide microarray for use as a collection of hybridization capture probes. This approach is being used by our group and others routinely and we are continuing to improve its performance. RESULTS: Here, we provide a complete protocol optimized for large aggregate sequence intervals and demonstrate its utility with the capture of all predicted amino acid coding sequence from 3,038 human genes using 241,700 60-mer oligonucleotides. Further, we demonstrate two techniques by which the efficiency of the capture can be increased: by introducing a step to block cross hybridization mediated by common adapter sequences used in sequencing library construction, and by repeating the hybridization capture step. These improvements can boost the targeting efficiency to the point where over 85% of the mapped sequence reads fall within 100 bases of the targeted regions. CONCLUSIONS: The complete protocol introduced in this paper enables researchers to perform practical capture experiments, and includes two novel methods for increasing the targeting efficiency. Coupled with the new massively parallel sequencing technologies, this provides a powerful approach to identifying disease-causing genetic variants that can be localized within the genome by traditional methods.
Asunto(s)
Sitios Genéticos , Genoma Humano , Análisis de Secuencia por Matrices de Oligonucleótidos/métodos , Análisis de Secuencia de ADN/métodos , ADN de Neoplasias/genética , Genes Relacionados con las Neoplasias , Biblioteca Genómica , Humanos , Alineación de SecuenciaRESUMEN
For many genome-wide association (GWA) studies individually genotyping one million or more SNPs provides a marginal increase in coverage at a substantial cost. Much of the information gained is redundant due to the correlation structure inherent in the human genome. Pooling-based GWA studies could benefit significantly by utilizing this redundancy to reduce noise, improve the accuracy of the observations and increase genomic coverage. We introduce a measure of correlation between individual genotyping and pooling, under the same framework that r(2) provides a measure of linkage disequilibrium (LD) between pairs of SNPs. We then report a new non-haplotype multimarker multi-loci method that leverages the correlation structure between SNPs in the human genome to increase the efficacy of pooling-based GWA studies. We first give a theoretical framework and derivation of our multimarker method. Next, we evaluate simulations using this multimarker approach in comparison to single marker analysis. Finally, we experimentally evaluate our method using different pools of HapMap individuals on the Illumina 450S Duo, Illumina 550K and Affymetrix 5.0 platforms for a combined total of 1 333 631 SNPs. Our results show that use of multimarker analysis reduces noise specific to pooling-based studies, allows for efficient integration of multiple microarray platforms and provides more accurate measures of significance than single marker analysis. Additionally, this approach can be extended to allow for imputing the association significance for SNPs not directly observed using neighboring SNPs in LD. This multimarker method can now be used to cost-effectively complete pooling-based GWA studies with multiple platforms across over one million SNPs and to impute neighboring SNPs weighted for the loss of information due to pooling.
Asunto(s)
Mapeo Cromosómico/métodos , Ligamiento Genético/genética , Marcadores Genéticos/genética , Haplotipos/genética , Desequilibrio de Ligamiento/genética , Polimorfismo de Nucleótido Simple/genética , Análisis de Secuencia de ADN/métodos , Secuencia de Bases , Análisis Mutacional de ADN/métodos , Datos de Secuencia MolecularRESUMEN
BACKGROUND: There is tremendous potential for genome sequencing to improve clinical diagnosis and care once it becomes routinely accessible, but this will require formalizing research methods into clinical best practices in the areas of sequence data generation, analysis, interpretation and reporting. The CLARITY Challenge was designed to spur convergence in methods for diagnosing genetic disease starting from clinical case history and genome sequencing data. DNA samples were obtained from three families with heritable genetic disorders and genomic sequence data were donated by sequencing platform vendors. The challenge was to analyze and interpret these data with the goals of identifying disease-causing variants and reporting the findings in a clinically useful format. Participating contestant groups were solicited broadly, and an independent panel of judges evaluated their performance. RESULTS: A total of 30 international groups were engaged. The entries reveal a general convergence of practices on most elements of the analysis and interpretation process. However, even given this commonality of approach, only two groups identified the consensus candidate variants in all disease cases, demonstrating a need for consistent fine-tuning of the generally accepted methods. There was greater diversity of the final clinical report content and in the patient consenting process, demonstrating that these areas require additional exploration and standardization. CONCLUSIONS: The CLARITY Challenge provides a comprehensive assessment of current practices for using genome sequencing to diagnose and report genetic diseases. There is remarkable convergence in bioinformatic techniques, but medical interpretation and reporting are areas that require further development by many groups.
Asunto(s)
Bases de Datos Genéticas/normas , Pruebas Genéticas/métodos , Genómica/métodos , Revisión de la Investigación por Pares , Análisis de Secuencia de ADN/métodos , Niño , Femenino , Organización de la Financiación , Pruebas Genéticas/economía , Pruebas Genéticas/normas , Genómica/economía , Genómica/normas , Cardiopatías Congénitas/diagnóstico , Cardiopatías Congénitas/genética , Humanos , Masculino , Miopatías Estructurales Congénitas/diagnóstico , Miopatías Estructurales Congénitas/genética , Análisis de Secuencia de ADN/economía , Análisis de Secuencia de ADN/normasRESUMEN
A primary component of next-generation sequencing analysis is to align short reads to a reference genome, with each read aligned independently. However, reads that observe the same non-reference DNA sequence are highly correlated and can be used to better model the true variation in the target genome. A novel short-read micro realigner, SRMA, that leverages this correlation to better resolve a consensus of the underlying DNA sequence of the targeted genome is described here.
Asunto(s)
Algoritmos , Biología Computacional/métodos , Genoma Humano , Alineación de Secuencia/métodos , Análisis de Secuencia de ADN/métodos , Secuencia de Bases , Línea Celular Tumoral , Frecuencia de los Genes , Humanos , Análisis de Secuencia por Matrices de Oligonucleótidos , Polimorfismo de Nucleótido SimpleRESUMEN
BACKGROUND: The new generation of massively parallel DNA sequencers, combined with the challenge of whole human genome resequencing, result in the need for rapid and accurate alignment of billions of short DNA sequence reads to a large reference genome. Speed is obviously of great importance, but equally important is maintaining alignment accuracy of short reads, in the 25-100 base range, in the presence of errors and true biological variation. METHODOLOGY: We introduce a new algorithm specifically optimized for this task, as well as a freely available implementation, BFAST, which can align data produced by any of current sequencing platforms, allows for user-customizable levels of speed and accuracy, supports paired end data, and provides for efficient parallel and multi-threaded computation on a computer cluster. The new method is based on creating flexible, efficient whole genome indexes to rapidly map reads to candidate alignment locations, with arbitrary multiple independent indexes allowed to achieve robustness against read errors and sequence variants. The final local alignment uses a Smith-Waterman method, with gaps to support the detection of small indels. CONCLUSIONS: We compare BFAST to a selection of large-scale alignment tools -- BLAT, MAQ, SHRiMP, and SOAP -- in terms of both speed and accuracy, using simulated and real-world datasets. We show BFAST can achieve substantially greater sensitivity of alignment in the context of errors and true variants, especially insertions and deletions, and minimize false mappings, while maintaining adequate speed compared to other current methods. We show BFAST can align the amount of data needed to fully resequence a human genome, one billion reads, with high sensitivity and accuracy, on a modest computer cluster in less than 24 hours. BFAST is available at (http://bfast.sourceforge.net).
Asunto(s)
Genoma Humano , Alineación de Secuencia/métodos , Algoritmos , Biología Computacional/métodos , Simulación por Computador , Técnicas Genéticas , Variación Genética , Humanos , Reproducibilidad de los Resultados , Análisis de Secuencia de ADN , Programas InformáticosRESUMEN
As a first step in analyzing high-throughput data in genome-wide studies, several algorithms are available to identify and prioritize candidates lists for downstream fine-mapping. The prioritized candidates could be differentially expressed genes, aberrations in comparative genomics hybridization studies, or single nucleotide polymorphisms (SNPs) in association studies. Different analysis algorithms are subject to various experimental artifacts and analytical features that lead to different candidate lists. However, little research has been carried out to theoretically quantify the consensus between different candidate lists and to compare the study specific accuracy of the analytical methods based on a known reference candidate list. Within the context of genome-wide studies, we propose a generic mathematical framework to statistically compare ranked lists of candidates from different algorithms with each other or, if available, with a reference candidate list. To cope with the growing need for intuitive visualization of high-throughput data in genome-wide studies, we describe a complementary customizable visualization tool. As a case study, we demonstrate application of our framework to the comparison and visualization of candidate lists generated in a DNA-pooling based genome-wide association study of CEPH data in the HapMap project, where prior knowledge from individual genotyping can be used to generate a true reference candidate list. The results provide a theoretical basis to compare the accuracy of various methods and to identify redundant methods, thus providing guidance for selecting the most suitable analysis method in genome-wide studies.
Asunto(s)
Algoritmos , Estudio de Asociación del Genoma Completo/métodos , Modelos Estadísticos , Alelos , Estudios de Casos y Controles , ADN/genética , Predisposición Genética a la Enfermedad , Haplotipos , Humanos , Análisis de Secuencia por Matrices de OligonucleótidosRESUMEN
We conducted a genome-wide association pooling study for cutaneous melanoma and performed validation in samples totaling 2,019 cases and 2,105 controls. Using pooling, we identified a new melanoma risk locus on chromosome 20 (rs910873 and rs1885120), with replication in two further samples (combined P < 1 x 10(-15)). The per allele odds ratio was 1.75 (1.53, 2.01), with evidence for stronger association in early-onset cases.
Asunto(s)
Cromosomas Humanos Par 20 , Predisposición Genética a la Enfermedad , Melanoma/genética , Polimorfismo de Nucleótido Simple , Neoplasias Cutáneas/genética , Adulto , Edad de Inicio , Estudios de Casos y Controles , Humanos , Desequilibrio de Ligamiento , Oportunidad RelativaRESUMEN
We report the development and validation of experimental methods, study designs, and analysis software for pooling-based genomewide association (GWA) studies that use high-throughput single-nucleotide-polymorphism (SNP) genotyping microarrays. We first describe a theoretical framework for establishing the effectiveness of pooling genomic DNA as a low-cost alternative to individually genotyping thousands of samples on high-density SNP microarrays. Next, we describe software called "GenePool," which directly analyzes SNP microarray probe intensity data and ranks SNPs by increased likelihood of being genetically associated with a trait or disorder. Finally, we apply these methods to experimental case-control data and demonstrate successful identification of published genetic susceptibility loci for a rare monogenic disease (sudden infant death with dysgenesis of the testes syndrome), a rare complex disease (progressive supranuclear palsy), and a common complex disease (Alzheimer disease) across multiple SNP genotyping platforms. On the basis of these theoretical calculations and their experimental validation, our results suggest that pooling-based GWA studies are a logical first step for determining whether major genetic associations exist in diseases with high heritability.