Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 41
Filter
1.
Nat Biotechnol ; 40(6): 932-937, 2022 06.
Article in English | MEDLINE | ID: mdl-35190689

ABSTRACT

Understanding the relationship between amino acid sequence and protein function is a long-standing challenge with far-reaching scientific and translational implications. State-of-the-art alignment-based techniques cannot predict function for one-third of microbial protein sequences, hampering our ability to exploit data from diverse organisms. Here, we train deep learning models to accurately predict functional annotations for unaligned amino acid sequences across rigorous benchmark assessments built from the 17,929 families of the protein families database Pfam. The models infer known patterns of evolutionary substitutions and learn representations that accurately cluster sequences from unseen families. Combining deep models with existing methods significantly improves remote homology detection, suggesting that the deep models learn complementary information. This approach extends the coverage of Pfam by >9.5%, exceeding additions made over the last decade, and predicts function for 360 human reference proteome proteins with no previous Pfam annotation. These results suggest that deep learning models will be a core component of future protein annotation tools.


Subject(s)
Deep Learning , Amino Acid Sequence , Databases, Protein , Humans , Molecular Sequence Annotation , Proteome/metabolism , Proteomics
2.
Nature ; 601(7893): 422-427, 2022 01.
Article in English | MEDLINE | ID: mdl-34987224

ABSTRACT

Maternal morbidity and mortality continue to rise, and pre-eclampsia is a major driver of this burden1. Yet the ability to assess underlying pathophysiology before clinical presentation to enable identification of pregnancies at risk remains elusive. Here we demonstrate the ability of plasma cell-free RNA (cfRNA) to reveal patterns of normal pregnancy progression and determine the risk of developing pre-eclampsia months before clinical presentation. Our results centre on comprehensive transcriptome data from eight independent prospectively collected cohorts comprising 1,840 racially diverse pregnancies and retrospective analysis of 2,539 banked plasma samples. The pre-eclampsia data include 524 samples (72 cases and 452 non-cases) from two diverse independent cohorts collected 14.5 weeks (s.d., 4.5 weeks) before delivery. We show that cfRNA signatures from a single blood draw can track pregnancy progression at the placental, maternal and fetal levels and can robustly predict pre-eclampsia, with a sensitivity of 75% and a positive predictive value of 32.3% (s.d., 3%), which is superior to the state-of-the-art method2. cfRNA signatures of normal pregnancy progression and pre-eclampsia are independent of clinical factors, such as maternal age, body mass index and race, which cumulatively account for less than 1% of model variance. Further, the cfRNA signature for pre-eclampsia contains gene features linked to biological processes implicated in the underlying pathophysiology of pre-eclampsia.


Subject(s)
Cell-Free Nucleic Acids , Pre-Eclampsia , RNA , Cell-Free Nucleic Acids/blood , Female , Humans , Pre-Eclampsia/diagnosis , Pre-Eclampsia/genetics , Predictive Value of Tests , Pregnancy , RNA/blood , Retrospective Studies , Sensitivity and Specificity
3.
Nat Biotechnol ; 37(10): 1155-1162, 2019 10.
Article in English | MEDLINE | ID: mdl-31406327

ABSTRACT

The DNA sequencing technologies in use today produce either highly accurate short reads or less-accurate long reads. We report the optimization of circular consensus sequencing (CCS) to improve the accuracy of single-molecule real-time (SMRT) sequencing (PacBio) and generate highly accurate (99.8%) long high-fidelity (HiFi) reads with an average length of 13.5 kilobases (kb). We applied our approach to sequence the well-characterized human HG002/NA24385 genome and obtained precision and recall rates of at least 99.91% for single-nucleotide variants (SNVs), 95.98% for insertions and deletions <50 bp (indels) and 95.99% for structural variants. Our CCS method matches or exceeds the ability of short-read sequencing to detect small variants and structural variants. We estimate that 2,434 discordances are correctable mistakes in the 'genome in a bottle' (GIAB) benchmark set. Nearly all (99.64%) variants can be phased into haplotypes, further improving variant detection. De novo genome assembly using CCS reads alone produced a contiguous and accurate genome with a contig N50 of >15 megabases (Mb) and concordance of 99.997%, substantially outperforming assembly with less-accurate long reads.


Subject(s)
DNA, Circular/genetics , Genome, Human , High-Throughput Nucleotide Sequencing/methods , Sequence Analysis, DNA/methods , Base Sequence , Genetic Variation , Haplotypes , Humans
4.
Bioinformatics ; 35(21): 4389-4391, 2019 11 01.
Article in English | MEDLINE | ID: mdl-30916319

ABSTRACT

SUMMARY: Reference genomes are refined to reflect error corrections and other improvements. While this process improves novel data generation and analysis, incorporating data analyzed on an older reference genome assembly requires transforming the coordinates and representations of the data to the new assembly. Multiple tools exist to perform this transformation for coordinate-only data types, but none supports accurate transformation of genome-wide short variation. Here we present GenomeWarp, a tool for efficiently transforming variants between genome assemblies. GenomeWarp transforms regions and short variants in a conservative manner to minimize false positive and negative variants in the target genome, and converts over 99% of regions and short variants from a representative human genome. AVAILABILITY AND IMPLEMENTATION: GenomeWarp is written in Java. All source code and the user manual are freely available at https://github.com/verilylifesciences/genomewarp. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Genomics , Software , Genome, Human , Humans
5.
Nat Biotechnol ; 36(10): 983-987, 2018 11.
Article in English | MEDLINE | ID: mdl-30247488

ABSTRACT

Despite rapid advances in sequencing technologies, accurately calling genetic variants present in an individual genome from billions of short, errorful sequence reads remains challenging. Here we show that a deep convolutional neural network can call genetic variation in aligned next-generation sequencing read data by learning statistical relationships between images of read pileups around putative variant and true genotype calls. The approach, called DeepVariant, outperforms existing state-of-the-art tools. The learned model generalizes across genome builds and mammalian species, allowing nonhuman sequencing projects to benefit from the wealth of human ground-truth data. We further show that DeepVariant can learn to call variants in a variety of sequencing technologies and experimental designs, including deep whole genomes from 10X Genomics and Ion Ampliseq exomes, highlighting the benefits of using more automated and generalizable techniques for variant calling.


Subject(s)
Genome, Human , Mammals/genetics , Neural Networks, Computer , Polymorphism, Single Nucleotide , Animals , DNA Mutational Analysis , Genomics , Genotype , High-Throughput Nucleotide Sequencing , Humans , INDEL Mutation , Sequence Analysis, DNA , Software
6.
Eur J Hum Genet ; 25(2): 227-233, 2017 02.
Article in English | MEDLINE | ID: mdl-27876817

ABSTRACT

Germline mutation detection from human DNA sequence data is challenging due to the rarity of such events relative to the intrinsic error rates of sequencing technologies and the uneven coverage across the genome. We developed PhaseByTransmission (PBT) to identify de novo single nucleotide variants and short insertions and deletions (indels) from sequence data collected in parent-offspring trios. We compute the joint probability of the data given the genotype likelihoods in the individual family members, the known familial relationships and a prior probability for the mutation rate. Candidate de novo mutations (DNMs) are reported along with their posterior probability, providing a systematic way to prioritize them for validation. Our tool is integrated in the Genome Analysis Toolkit and can be used together with the ReadBackedPhasing module to infer the parental origin of DNMs based on phase-informative reads. Using simulated data, we show that PBT outperforms existing tools, especially in low coverage data and on the X chromosome. We further show that PBT displays high validation rates on empirical parent-offspring sequencing data for whole-exome data from 104 trios and X-chromosome data from 249 parent-offspring families. Finally, we demonstrate an association between father's age at conception and the number of DNMs in female offspring's X chromosome, consistent with previous literature reports.


Subject(s)
Genome-Wide Association Study/methods , Germ-Line Mutation , Pedigree , Polymorphism, Single Nucleotide , Sequence Analysis, DNA/methods , Software , Adult , Child , Chromosomes, Human, X/genetics , Exome , Female , Genotype , Humans , Male , Models, Genetic
7.
Nature ; 518(7537): 102-6, 2015 Feb 05.
Article in English | MEDLINE | ID: mdl-25487149

ABSTRACT

Myocardial infarction (MI), a leading cause of death around the world, displays a complex pattern of inheritance. When MI occurs early in life, genetic inheritance is a major component to risk. Previously, rare mutations in low-density lipoprotein (LDL) genes have been shown to contribute to MI risk in individual families, whereas common variants at more than 45 loci have been associated with MI risk in the population. Here we evaluate how rare mutations contribute to early-onset MI risk in the population. We sequenced the protein-coding regions of 9,793 genomes from patients with MI at an early age (≤50 years in males and ≤60 years in females) along with MI-free controls. We identified two genes in which rare coding-sequence mutations were more frequent in MI cases versus controls at exome-wide significance. At low-density lipoprotein receptor (LDLR), carriers of rare non-synonymous mutations were at 4.2-fold increased risk for MI; carriers of null alleles at LDLR were at even higher risk (13-fold difference). Approximately 2% of early MI cases harbour a rare, damaging mutation in LDLR; this estimate is similar to one made more than 40 years ago using an analysis of total cholesterol. Among controls, about 1 in 217 carried an LDLR coding-sequence mutation and had plasma LDL cholesterol > 190 mg dl(-1). At apolipoprotein A-V (APOA5), carriers of rare non-synonymous mutations were at 2.2-fold increased risk for MI. When compared with non-carriers, LDLR mutation carriers had higher plasma LDL cholesterol, whereas APOA5 mutation carriers had higher plasma triglycerides. Recent evidence has connected MI risk with coding-sequence mutations at two genes functionally related to APOA5, namely lipoprotein lipase and apolipoprotein C-III (refs 18, 19). Combined, these observations suggest that, as well as LDL cholesterol, disordered metabolism of triglyceride-rich lipoproteins contributes to MI risk.


Subject(s)
Alleles , Apolipoproteins A/genetics , Exome/genetics , Genetic Predisposition to Disease/genetics , Myocardial Infarction/genetics , Receptors, LDL/genetics , Age Factors , Age of Onset , Apolipoprotein A-V , Case-Control Studies , Cholesterol, LDL/blood , Coronary Artery Disease/genetics , Female , Genetics, Population , Heterozygote , Humans , Male , Middle Aged , Mutation/genetics , Myocardial Infarction/blood , National Heart, Lung, and Blood Institute (U.S.) , Triglycerides/blood , United States
8.
Genome Biol ; 15(6): R88, 2014 Jun 30.
Article in English | MEDLINE | ID: mdl-24980144

ABSTRACT

BACKGROUND: Population differentiation has proved to be effective for identifying loci under geographically localized positive selection, and has the potential to identify loci subject to balancing selection. We have previously investigated the pattern of genetic differentiation among human populations at 36.8 million genomic variants to identify sites in the genome showing high frequency differences. Here, we extend this dataset to include additional variants, survey sites with low levels of differentiation, and evaluate the extent to which highly differentiated sites are likely to result from selective or other processes. RESULTS: We demonstrate that while sites with low differentiation represent sampling effects rather than balancing selection, sites showing extremely high population differentiation are enriched for positive selection events and that one half may be the result of classic selective sweeps. Among these, we rediscover known examples, where we actually identify the established functional SNP, and discover novel examples including the genes ABCA12, CALD1 and ZNF804, which we speculate may be linked to adaptations in skin, calcium metabolism and defense, respectively. CONCLUSIONS: We identify known and many novel candidate regions for geographically restricted positive selection, and suggest several directions for further research.


Subject(s)
Genome, Human , INDEL Mutation , Polymorphism, Single Nucleotide , Gene Frequency , Genetic Drift , Humans , Selection, Genetic , Sequence Analysis, DNA
9.
Curr Protoc Bioinformatics ; 43: 11.10.1-11.10.33, 2013.
Article in English | MEDLINE | ID: mdl-25431634

ABSTRACT

This unit describes how to use BWA and the Genome Analysis Toolkit (GATK) to map genome sequencing data to a reference and produce high-quality variant calls that can be used in downstream analyses. The complete workflow includes the core NGS data processing steps that are necessary to make the raw data suitable for analysis by the GATK, as well as the key methods involved in variant discovery using the GATK.


Subject(s)
Genetic Variation , Genome, Human , Software , Calibration , Databases, Genetic , Haploidy , Haplotypes/genetics , Humans , Molecular Sequence Annotation , Polymorphism, Single Nucleotide/genetics , Sequence Alignment
10.
Nature ; 491(7422): 56-65, 2012 Nov 01.
Article in English | MEDLINE | ID: mdl-23128226

ABSTRACT

By characterizing the geographic and functional spectrum of human genetic variation, the 1000 Genomes Project aims to build a resource to help to understand the genetic contribution to disease. Here we describe the genomes of 1,092 individuals from 14 populations, constructed using a combination of low-coverage whole-genome and exome sequencing. By developing methods to integrate information across several algorithms and diverse data sources, we provide a validated haplotype map of 38 million single nucleotide polymorphisms, 1.4 million short insertions and deletions, and more than 14,000 larger deletions. We show that individuals from different populations carry different profiles of rare and common variants, and that low-frequency variants show substantial geographic differentiation, which is further increased by the action of purifying selection. We show that evolutionary conservation and coding consequence are key determinants of the strength of purifying selection, that rare-variant load varies substantially across biological pathways, and that each individual contains hundreds of rare non-coding variants at conserved sites, such as motif-disrupting changes in transcription-factor-binding sites. This resource, which captures up to 98% of accessible single nucleotide polymorphisms at a frequency of 1% in related populations, enables analysis of common and low-frequency variants in individuals from diverse, including admixed, populations.


Subject(s)
Genetic Variation/genetics , Genetics, Population , Genome, Human/genetics , Genomics , Alleles , Binding Sites/genetics , Conserved Sequence/genetics , Evolution, Molecular , Genetics, Medical , Genome-Wide Association Study , Haplotypes/genetics , Humans , Nucleotide Motifs , Polymorphism, Single Nucleotide/genetics , Racial Groups/genetics , Sequence Deletion/genetics , Transcription Factors/metabolism
11.
BMC Genomics ; 13: 375, 2012 Aug 05.
Article in English | MEDLINE | ID: mdl-22863213

ABSTRACT

BACKGROUND: Pacific Biosciences technology provides a fundamentally new data type that provides the potential to overcome some limitations of current next generation sequencing platforms by providing significantly longer reads, single molecule sequencing, low composition bias and an error profile that is orthogonal to other platforms. With these potential advantages in mind, we here evaluate the utility of the Pacific Biosciences RS platform for human medical amplicon resequencing projects. RESULTS: We evaluated the Pacific Biosciences technology for SNP discovery in medical resequencing projects using the Genome Analysis Toolkit, observing high sensitivity and specificity for calling differences in amplicons containing known true or false SNPs. We assessed data quality: most errors were indels (~14%) with few apparent miscalls (~1%). In this work, we define a custom data processing pipeline for Pacific Biosciences data for human data analysis. CONCLUSION: Critically, the error properties were largely free of the context-specific effects that affect other sequencing technologies. These data show excellent utility for follow-up validation and extension studies in human data and medical genetics projects, but can be extended to other organisms with a reference genome.


Subject(s)
Sequence Analysis, DNA , Genetic Variation , Genome, Human , Genotype , Humans , Polymorphism, Single Nucleotide , Software , User-Computer Interface
12.
PLoS Comput Biol ; 8(7): e1002604, 2012.
Article in English | MEDLINE | ID: mdl-22807667

ABSTRACT

High coverage whole genome sequencing provides near complete information about genetic variation. However, other technologies can be more efficient in some settings by (a) reducing redundant coverage within samples and (b) exploiting patterns of genetic variation across samples. To characterize as many samples as possible, many genetic studies therefore employ lower coverage sequencing or SNP array genotyping coupled to statistical imputation. To compare these approaches individually and in conjunction, we developed a statistical framework to estimate genotypes jointly from sequence reads, array intensities, and imputation. In European samples, we find similar sensitivity (89%) and specificity (99.6%) from imputation with either 1× sequencing or 1 M SNP arrays. Sensitivity is increased, particularly for low-frequency polymorphisms (MAF < 5%), when low coverage sequence reads are added to dense genome-wide SNP arrays--the converse, however, is not true. At sites where sequence reads and array intensities produce different sample genotypes, joint analysis reduces genotype errors and identifies novel error modes. Our joint framework informs the use of next-generation sequencing in genome wide association studies and supports development of improved methods for genotype calling.


Subject(s)
Genomics/methods , Oligonucleotide Array Sequence Analysis/methods , Polymorphism, Single Nucleotide , Sequence Analysis, DNA/methods , Algorithms , Cluster Analysis , Databases, Genetic , Genome-Wide Association Study , Genotype , Humans , Sensitivity and Specificity , White People
13.
Science ; 335(6070): 823-8, 2012 Feb 17.
Article in English | MEDLINE | ID: mdl-22344438

ABSTRACT

Genome-sequencing studies indicate that all humans carry many genetic variants predicted to cause loss of function (LoF) of protein-coding genes, suggesting unexpected redundancy in the human genome. Here we apply stringent filters to 2951 putative LoF variants obtained from 185 human genomes to determine their true prevalence and properties. We estimate that human genomes typically contain ~100 genuine LoF variants with ~20 genes completely inactivated. We identify rare and likely deleterious LoF alleles, including 26 known and 21 predicted severe disease-causing variants, as well as common LoF variants in nonessential genes. We describe functional and evolutionary differences between LoF-tolerant and recessive disease genes and a method for using these differences to prioritize candidate genes found in clinical sequencing studies.


Subject(s)
Genetic Variation , Genome, Human , Proteins/genetics , Disease/genetics , Gene Expression , Gene Frequency , Humans , Phenotype , Polymorphism, Single Nucleotide , Selection, Genetic
14.
Bioinformatics ; 27(15): 2156-8, 2011 Aug 01.
Article in English | MEDLINE | ID: mdl-21653522

ABSTRACT

SUMMARY: The variant call format (VCF) is a generic format for storing DNA polymorphism data such as SNPs, insertions, deletions and structural variants, together with rich annotations. VCF is usually stored in a compressed manner and can be indexed for fast data retrieval of variants from a range of positions on the reference genome. The format was developed for the 1000 Genomes Project, and has also been adopted by other projects such as UK10K, dbSNP and the NHLBI Exome Project. VCFtools is a software suite that implements various utilities for processing VCF files, including validation, merging, comparing and also provides a general Perl API. AVAILABILITY: http://vcftools.sourceforge.net


Subject(s)
Genetic Variation , Genomics/methods , Information Storage and Retrieval/methods , Software , Alleles , Genome, Human , Genotype , Humans
15.
Nat Genet ; 43(7): 712-4, 2011 Jun 12.
Article in English | MEDLINE | ID: mdl-21666693

ABSTRACT

J.B.S. Haldane proposed in 1947 that the male germline may be more mutagenic than the female germline. Diverse studies have supported Haldane's contention of a higher average mutation rate in the male germline in a variety of mammals, including humans. Here we present, to our knowledge, the first direct comparative analysis of male and female germline mutation rates from the complete genome sequences of two parent-offspring trios. Through extensive validation, we identified 49 and 35 germline de novo mutations (DNMs) in two trio offspring, as well as 1,586 non-germline DNMs arising either somatically or in the cell lines from which the DNA was derived. Most strikingly, in one family, we observed that 92% of germline DNMs were from the paternal germline, whereas, in contrast, in the other family, 64% of DNMs were from the maternal germline. These observations suggest considerable variation in mutation rates within and between families.


Subject(s)
Family , Genetic Variation , Genome, Human , Germ-Line Mutation/genetics , Chromosome Mapping , DNA Mutational Analysis , Female , Humans , Male , Polymerase Chain Reaction
16.
Nat Genet ; 43(5): 491-8, 2011 May.
Article in English | MEDLINE | ID: mdl-21478889

ABSTRACT

Recent advances in sequencing technology make it possible to comprehensively catalog genetic variation in population samples, creating a foundation for understanding human disease, ancestry and evolution. The amounts of raw data produced are prodigious, and many computational steps are required to translate this output into high-quality variant calls. We present a unified analytic framework to discover and genotype variation among multiple samples simultaneously that achieves sensitive and specific results across five sequencing technologies and three distinct, canonical experimental designs. Our process includes (i) initial read mapping; (ii) local realignment around indels; (iii) base quality score recalibration; (iv) SNP discovery and genotyping to find all potential variants; and (v) machine learning to separate true segregating variation from machine artifacts common to next-generation sequencing technologies. We here discuss the application of these tools, instantiated in the Genome Analysis Toolkit, to deep whole-genome, whole-exome capture and multi-sample low-pass (∼4×) 1000 Genomes Project datasets.


Subject(s)
Genetic Variation , Genotype , Sequence Analysis, DNA/methods , Data Interpretation, Statistical , Databases, Nucleic Acid , Exons , Genetics, Population/methods , Genetics, Population/statistics & numerical data , Genome, Human , Humans , Polymorphism, Single Nucleotide , Sequence Alignment/methods , Sequence Alignment/statistics & numerical data , Sequence Analysis, DNA/statistics & numerical data , Software
17.
Hum Mol Genet ; 20(7): 1285-9, 2011 Apr 01.
Article in English | MEDLINE | ID: mdl-21212097

ABSTRACT

Exome sequencing is a powerful tool for discovery of the Mendelian disease genes. Previously, we reported a novel locus for autosomal recessive non-syndromic mental retardation (NSMR) in a consanguineous family [Nolan, D.K., Chen, P., Das, S., Ober, C. and Waggoner, D. (2008) Fine mapping of a locus for nonsyndromic mental retardation on chromosome 19p13. Am. J. Med. Genet. A, 146A, 1414-1422]. Using linkage and homozygosity mapping, we previously localized the gene to chromosome 19p13. The parents of this sibship were recently included in an exome sequencing project. Using a series of filters, we narrowed the putative causal mutation to a single variant site that segregated with NSMR: the mutation was homozygous in five affected siblings but in none of eight unaffected siblings. This mutation causes a substitution of a leucine for a highly conserved proline at amino acid 182 in TECR (trans-2,3-enoyl-CoA reductase), a synaptic glycoprotein. Our results reveal the value of massively parallel sequencing for identification of novel disease genes that could not be found using traditional approaches and identifies only the seventh causal mutation for autosomal recessive NSMR.


Subject(s)
Chromosomes, Human, Pair 19/genetics , Genetic Diseases, Inborn/genetics , Intellectual Disability/genetics , Membrane Glycoproteins/genetics , Mutation , Oxidoreductases/genetics , Synaptic Membranes/genetics , Female , Genetic Diseases, Inborn/enzymology , Humans , Intellectual Disability/enzymology , Male , Membrane Glycoproteins/metabolism , Oxidoreductases/metabolism , Pedigree , Synaptic Membranes/enzymology
18.
BMC Genomics ; 12: 42, 2011 Jan 18.
Article in English | MEDLINE | ID: mdl-21244689

ABSTRACT

BACKGROUND: Comprehensive sequence characterization across the MHC is important for successful organ transplantation and genetic association studies. To this end, we have developed an automated sample preparation, molecular barcoding and multiplexing protocol for the amplification and sequence-determination of class I HLA loci. We have coupled this process to a novel HLA calling algorithm to determine the most likely pair of alleles at each locus. RESULTS: We have benchmarked our protocol with 270 HapMap individuals from four worldwide populations with 96.4% accuracy at 4-digit resolution. A variation of this initial protocol, more suitable for large sample sizes, in which molecular barcodes are added during PCR rather than library construction, was tested on 95 HapMap individuals with 98.6% accuracy at 4-digit resolution. CONCLUSIONS: Next-generation sequencing on the 454 FLX Titanium platform is a reliable, efficient, and scalable technology for HLA typing.


Subject(s)
Genes, MHC Class I/genetics , Histocompatibility Testing/methods , Sequence Analysis, DNA/methods , Humans , Polymerase Chain Reaction
19.
N Engl J Med ; 363(23): 2220-7, 2010 Dec 02.
Article in English | MEDLINE | ID: mdl-20942659

ABSTRACT

We sequenced all protein-coding regions of the genome (the "exome") in two family members with combined hypolipidemia, marked by extremely low plasma levels of low-density lipoprotein (LDL) cholesterol, high-density lipoprotein (HDL) cholesterol, and triglycerides. These two participants were compound heterozygotes for two distinct nonsense mutations in ANGPTL3 (encoding the angiopoietin-like 3 protein). ANGPTL3 has been reported to inhibit lipoprotein lipase and endothelial lipase, thereby increasing plasma triglyceride and HDL cholesterol levels in rodents. Our finding of ANGPTL3 mutations highlights a role for the gene in LDL cholesterol metabolism in humans and shows the usefulness of exome sequencing for identification of novel genetic causes of inherited disorders. (Funded by the National Human Genome Research Institute and others.).


Subject(s)
Angiopoietins/genetics , Codon, Nonsense , Hypobetalipoproteinemias/genetics , Angiopoietin-Like Protein 3 , Angiopoietin-like Proteins , Cholesterol, HDL/blood , Cholesterol, HDL/genetics , Cholesterol, LDL/blood , Cholesterol, LDL/genetics , DNA Mutational Analysis , Female , Genetic Linkage , Humans , Male , Pedigree
20.
Genome Res ; 20(9): 1297-303, 2010 Sep.
Article in English | MEDLINE | ID: mdl-20644199

ABSTRACT

Next-generation DNA sequencing (NGS) projects, such as the 1000 Genomes Project, are already revolutionizing our understanding of genetic variation among individuals. However, the massive data sets generated by NGS--the 1000 Genome pilot alone includes nearly five terabases--make writing feature-rich, efficient, and robust analysis tools difficult for even computationally sophisticated individuals. Indeed, many professionals are limited in the scope and the ease with which they can answer scientific questions by the complexity of accessing and manipulating the data produced by these machines. Here, we discuss our Genome Analysis Toolkit (GATK), a structured programming framework designed to ease the development of efficient and robust analysis tools for next-generation DNA sequencers using the functional programming philosophy of MapReduce. The GATK provides a small but rich set of data access patterns that encompass the majority of analysis tool needs. Separating specific analysis calculations from common data management infrastructure enables us to optimize the GATK framework for correctness, stability, and CPU and memory efficiency and to enable distributed and shared memory parallelization. We highlight the capabilities of the GATK by describing the implementation and application of robust, scale-tolerant tools like coverage calculators and single nucleotide polymorphism (SNP) calling. We conclude that the GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.


Subject(s)
Genome , Genomics/methods , Sequence Analysis, DNA/methods , Software , Base Sequence
SELECTION OF CITATIONS
SEARCH DETAIL
...