RESUMEN
Haplotype-resolved or phased genome assembly provides a complete picture of genomes and their complex genetic variations. However, current algorithms for phased assembly either do not generate chromosome-scale phasing or require pedigree information, which limits their application. We present a method named diploid assembly (DipAsm) that uses long, accurate reads and long-range conformation data for single individuals to generate a chromosome-scale phased assembly within 1 day. Applied to four public human genomes, PGP1, HG002, NA12878 and HG00733, DipAsm produced haplotype-resolved assemblies with minimum contig length needed to cover 50% of the known genome (NG50) up to 25 Mb and phased ~99.5% of heterozygous sites at 98-99% accuracy, outperforming other approaches in terms of both contiguity and phasing completeness. We demonstrate the importance of chromosome-scale phased assemblies for the discovery of structural variants (SVs), including thousands of new transposon insertions, and of highly polymorphic and medically important regions such as the human leukocyte antigen (HLA) and killer cell immunoglobulin-like receptor (KIR) regions. DipAsm will facilitate high-quality precision medicine and studies of individual haplotype variation and population diversity.
Asunto(s)
Cromosomas Humanos , Genoma Humano , Haplotipos , Algoritmos , Heterocigoto , Humanos , Polimorfismo de Nucleótido SimpleRESUMEN
BACKGROUND: Hereditary cancer screening (HCS) for germline variants in the 3' exons of PMS2, a mismatch repair gene implicated in Lynch syndrome, is technically challenging due to homology with its pseudogene PMS2CL. Sequences of PMS2 and PMS2CL are so similar that next-generation sequencing (NGS) of short fragments-common practice in multigene HCS panels-may identify the presence of a variant but fail to disambiguate whether its origin is the gene or the pseudogene. Molecular approaches utilizing longer DNA fragments, such as long-range PCR (LR-PCR), can definitively localize variants in PMS2, yet applying such testing to all samples can have logistical and economic drawbacks. METHODS: To address these drawbacks, we propose and characterize a reflex workflow for variant discovery in the 3' exons of PMS2. We cataloged the natural variation in PMS2 and PMS2CL in 707 samples and designed hybrid-capture probes to enrich the gene and pseudogene with equal efficiency. For PMS2 exon 11, NGS reads were aligned, filtered using gene-specific variants, and subject to standard diploid variant calling. For PMS2 exons 12-15, the NGS reads were permissively aligned to PMS2, and variant calling was performed with the expectation of observing four alleles (i.e., tetraploid calling). In this reflex workflow, short-read NGS identifies potentially reportable variants that are then subject to disambiguation via LR-PCR-based testing. RESULTS: Applying short-read NGS screening to 299 HCS samples and cell lines demonstrated >99% analytical sensitivity and >99% analytical specificity for single-nucleotide variants (SNVs) and short insertions and deletions (indels), as well as >96% analytical sensitivity and >99% analytical specificity for copy-number variants. Importantly, 92% of samples had resolved genotypes from short-read NGS alone, with the remaining 8% requiring LR-PCR reflex. CONCLUSION: Our reflex workflow mitigates the challenges of screening in PMS2 and serves as a guide for clinical laboratories performing multigene HCS. To facilitate future exploration and testing of PMS2 variants, we share the raw and processed LR-PCR data from commercially available cell lines, as well as variant frequencies from a diverse patient cohort.
Asunto(s)
Neoplasias Colorrectales Hereditarias sin Poliposis/genética , Detección Precoz del Cáncer/métodos , Endonucleasa PMS2 de Reparación del Emparejamiento Incorrecto/genética , Proteínas de Neoplasias/genética , Reacción en Cadena de la Polimerasa/métodos , Seudogenes , Alelos , Línea Celular Tumoral , Neoplasias Colorrectales Hereditarias sin Poliposis/diagnóstico , Neoplasias Colorrectales Hereditarias sin Poliposis/metabolismo , Detección Precoz del Cáncer/instrumentación , Exones , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Endonucleasa PMS2 de Reparación del Emparejamiento Incorrecto/análisis , Endonucleasa PMS2 de Reparación del Emparejamiento Incorrecto/metabolismo , Proteínas de Neoplasias/análisis , Proteínas de Neoplasias/metabolismo , Reacción en Cadena de la Polimerasa/normas , Sensibilidad y EspecificidadRESUMEN
BACKGROUND: By identifying pathogenic variants across hundreds of genes, expanded carrier screening (ECS) enables prospective parents to assess the risk of transmitting an autosomal recessive or X-linked condition. Detection of at-risk couples depends on the number of conditions tested, the prevalence of the respective diseases, and the screen's analytical sensitivity for identifying disease-causing variants. Disease-level analytical sensitivity is often <100% in ECS tests because copy number variants (CNVs) are typically not interrogated because of their technical complexity. METHODS: We present an analytical validation and preliminary clinical characterization of a 235-gene sequencing-based ECS with full coverage across coding regions, targeted assessment of pathogenic noncoding variants, panel-wide CNV calling, and specialized assays for technically challenging genes. Next-generation sequencing, customized bioinformatics, and expert manual call review were used to identify single-nucleotide variants, short insertions and deletions, and CNVs for all genes except FMR1 and those whose low disease incidence or high technical complexity precluded novel variant identification or interpretation. RESULTS: Screening of 36859 patients' blood or saliva samples revealed the substantial impact on fetal disease-risk detection attributable to novel CNVs (9.19% of risk) and technically challenging conditions (20.2% of risk), such as congenital adrenal hyperplasia. Of the 7498 couples screened, 335 were identified as at risk for an affected pregnancy, underscoring the clinical importance of the test. Validation of our ECS demonstrated >99% analytical sensitivity and >99% analytical specificity. CONCLUSIONS: Validated high-fidelity identification of different variant types-especially for diseases with complicated molecular genetics-maximizes at-risk couple detection.
Asunto(s)
Variaciones en el Número de Copia de ADN , Exones , Tamización de Portadores Genéticos , Estudios de Cohortes , Humanos , Mutación INDEL , Polimorfismo de Nucleótido SimpleRESUMEN
The past two decades have brought many important advances in our understanding of the hereditary susceptibility to cancer. Numerous studies have provided convincing evidence that identification of germline mutations associated with hereditary cancer syndromes can lead to reductions in morbidity and mortality through targeted risk management options. Additionally, advances in gene sequencing technology now permit the development of multigene hereditary cancer testing panels. Here, we describe the 2016 revision of the Counsyl Inherited Cancer Screen for detecting single-nucleotide variants (SNVs), short insertions and deletions (indels), and copy number variants (CNVs) in 36 genes associated with an elevated risk for breast, ovarian, colorectal, gastric, endometrial, pancreatic, thyroid, prostate, melanoma, and neuroendocrine cancers. To determine test accuracy and reproducibility, we performed a rigorous analytical validation across 341 samples, including 118 cell lines and 223 patient samples. The screen achieved 100% test sensitivity across different mutation types, with high specificity and 100% concordance with conventional Sanger sequencing and multiplex ligation-dependent probe amplification (MLPA). We also demonstrated the screen's high intra-run and inter-run reproducibility and robust performance on blood and saliva specimens. Furthermore, we showed that pathogenic Alu element insertions can be accurately detected by our test. Overall, the validation in our clinical laboratory demonstrated the analytical performance required for collecting and reporting genetic information related to risk of developing hereditary cancers.
RESUMEN
Hereditary breast and ovarian cancer syndrome, caused by a germline pathogenic variant in the BRCA1 or BRCA2 (BRCA1/2) genes, is characterized by an increased risk for breast, ovarian, pancreatic and other cancers. Identification of those who have a BRCA1/2 mutation is important so that they can take advantage of genetic counseling, screening, and potentially life-saving prevention strategies. We describe the design and analytic validation of the Counsyl Inherited Cancer Screen, a next-generation-sequencing-based test to detect pathogenic variation in the BRCA1 and BRCA2 genes. We demonstrate that the test is capable of detecting single-nucleotide variants (SNVs), short insertions and deletions (indels), and copy-number variants (CNVs, also known as large rearrangements) with zero errors over a 114-sample validation set consisting of samples from cell lines and deidentified patient samples, including 36 samples with BRCA1/2pathogenic germline mutations.
RESUMEN
We report on results from whole-exome sequencing (WES) of 1,039 subjects diagnosed with autism spectrum disorders (ASD) and 870 controls selected from the NIMH repository to be of similar ancestry to cases. The WES data came from two centers using different methods to produce sequence and to call variants from it. Therefore, an initial goal was to ensure the distribution of rare variation was similar for data from different centers. This proved straightforward by filtering called variants by fraction of missing data, read depth, and balance of alternative to reference reads. Results were evaluated using seven samples sequenced at both centers and by results from the association study. Next we addressed how the data and/or results from the centers should be combined. Gene-based analyses of association was an obvious choice, but should statistics for association be combined across centers (meta-analysis) or should data be combined and then analyzed (mega-analysis)? Because of the nature of many gene-based tests, we showed by theory and simulations that mega-analysis has better power than meta-analysis. Finally, before analyzing the data for association, we explored the impact of population structure on rare variant analysis in these data. Like other recent studies, we found evidence that population structure can confound case-control studies by the clustering of rare variants in ancestry space; yet, unlike some recent studies, for these data we found that principal component-based analyses were sufficient to control for ancestry and produce test statistics with appropriate distributions. After using a variety of gene-based tests and both meta- and mega-analysis, we found no new risk genes for ASD in this sample. Our results suggest that standard gene-based tests will require much larger samples of cases and controls before being effective for gene discovery, even for a disorder like ASD.
Asunto(s)
Trastornos Generalizados del Desarrollo Infantil/genética , Exoma , Estudio de Asociación del Genoma Completo , Estudios de Casos y Controles , Niño , Trastornos Generalizados del Desarrollo Infantil/fisiopatología , Predisposición Genética a la Enfermedad , Variación Genética , Humanos , Regulación de la Población , Análisis de Secuencia de ADN , Programas InformáticosRESUMEN
Genome-wide association studies have identified hundreds of loci for type 2 diabetes, coronary artery disease and myocardial infarction, as well as for related traits such as body mass index, glucose and insulin levels, lipid levels, and blood pressure. These studies also have pointed to thousands of loci with promising but not yet compelling association evidence. To establish association at additional loci and to characterize the genome-wide significant loci by fine-mapping, we designed the "Metabochip," a custom genotyping array that assays nearly 200,000 SNP markers. Here, we describe the Metabochip and its component SNP sets, evaluate its performance in capturing variation across the allele-frequency spectrum, describe solutions to methodological challenges commonly encountered in its analysis, and evaluate its performance as a platform for genotype imputation. The metabochip achieves dramatic cost efficiencies compared to designing single-trait follow-up reagents, and provides the opportunity to compare results across a range of related traits. The metabochip and similar custom genotyping arrays offer a powerful and cost-effective approach to follow-up large-scale genotyping and sequencing studies and advance our understanding of the genetic basis of complex human diseases and traits.
Asunto(s)
Antropometría/instrumentación , Metabolómica/instrumentación , Análisis de Secuencia por Matrices de Oligonucleótidos/instrumentación , Polimorfismo de Nucleótido Simple , Sitios de Carácter Cuantitativo , Alelos , Antropometría/métodos , Enfermedades Cardiovasculares/diagnóstico , Enfermedades Cardiovasculares/genética , Enfermedades Cardiovasculares/metabolismo , Diabetes Mellitus Tipo 2/diagnóstico , Diabetes Mellitus Tipo 2/genética , Diabetes Mellitus Tipo 2/metabolismo , Frecuencia de los Genes , Genoma Humano , Estudio de Asociación del Genoma Completo , Genotipo , Técnicas de Genotipaje , Humanos , Metabolómica/métodos , Análisis de Secuencia por Matrices de Oligonucleótidos/métodos , FenotipoRESUMEN
SUMMARY: zCall is a variant caller specifically designed for calling rare single-nucleotide polymorphisms from array-based technology. This caller is implemented as a post-processing step after a default calling algorithm has been applied. The algorithm uses the intensity profile of the common allele homozygote cluster to define the location of the other two genotype clusters. We demonstrate improved detection of rare alleles when applying zCall to samples that have both Illumina Infinium HumanExome BeadChip and exome sequencing data available. AVAILABILITY: http://atguweb.mgh.harvard.edu/apps/zcall. CONTACT: bneale@broadinstitute.org SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Algoritmos , Técnicas de Genotipaje , Polimorfismo de Nucleótido Simple , Programas Informáticos , Alelos , Análisis por Conglomerados , Exoma , Homocigoto , HumanosRESUMEN
Autism spectrum disorders (ASD) are believed to have genetic and environmental origins, yet in only a modest fraction of individuals can specific causes be identified. To identify further genetic risk factors, here we assess the role of de novo mutations in ASD by sequencing the exomes of ASD cases and their parents (n = 175 trios). Fewer than half of the cases (46.3%) carry a missense or nonsense de novo variant, and the overall rate of mutation is only modestly higher than the expected rate. In contrast, the proteins encoded by genes that harboured de novo missense or nonsense mutations showed a higher degree of connectivity among themselves and to previous ASD genes as indexed by protein-protein interaction screens. The small increase in the rate of de novo events, when taken together with the protein interaction results, are consistent with an important but limited role for de novo point mutations in ASD, similar to that documented for de novo copy number variants. Genetic models incorporating these data indicate that most of the observed de novo events are unconnected to ASD; those that do confer risk are distributed across many genes and are incompletely penetrant (that is, not necessarily sufficient for disease). Our results support polygenic models in which spontaneous coding mutations in any of a large number of genes increases risk by 5- to 20-fold. Despite the challenge posed by such models, results from de novo events and a large parallel case-control study provide strong evidence in favour of CHD8 and KATNAL2 as genuine autism risk factors.
Asunto(s)
Trastorno Autístico/genética , Proteínas de Unión al ADN/genética , Exones/genética , Predisposición Genética a la Enfermedad/genética , Mutación/genética , Factores de Transcripción/genética , Estudios de Casos y Controles , Exoma/genética , Salud de la Familia , Humanos , Modelos Genéticos , Herencia Multifactorial/genética , Fenotipo , Distribución de Poisson , Mapas de Interacción de ProteínasRESUMEN
Noncoding variants at human chromosome 9p21 near CDKN2A and CDKN2B are associated with type 2 diabetes, myocardial infarction, aneurysm, vertical cup disc ratio and at least five cancers. Here we compare approaches to more comprehensively assess genetic variation in the region. We carried out targeted sequencing at high coverage in 47 individuals and compared the results to pilot data from the 1000 Genomes Project. We imputed variants into type 2 diabetes and myocardial infarction cohorts directly from targeted sequencing, from a genotyped reference panel derived from sequencing and from 1000 Genomes Project low-coverage data. Polymorphisms with frequency >5% were captured well by all strategies. Imputation of intermediate-frequency polymorphisms required a higher density of tag SNPs in disease samples than is available on first-generation genome-wide association study (GWAS) arrays. Our association analyses identified more comprehensive sets of variants showing equivalent statistical association with type 2 diabetes or myocardial infarction, but did not identify stronger associations than the original GWAS signals.
Asunto(s)
Mapeo Cromosómico , Cromosomas Humanos Par 9/genética , Diabetes Mellitus Tipo 2/genética , Variación Genética/genética , Estudio de Asociación del Genoma Completo , Infarto del Miocardio/genética , Polimorfismo de Nucleótido Simple/genética , Genoma Humano , Haplotipos/genética , HumanosRESUMEN
Recent advances in sequencing technology make it possible to comprehensively catalog genetic variation in population samples, creating a foundation for understanding human disease, ancestry and evolution. The amounts of raw data produced are prodigious, and many computational steps are required to translate this output into high-quality variant calls. We present a unified analytic framework to discover and genotype variation among multiple samples simultaneously that achieves sensitive and specific results across five sequencing technologies and three distinct, canonical experimental designs. Our process includes (i) initial read mapping; (ii) local realignment around indels; (iii) base quality score recalibration; (iv) SNP discovery and genotyping to find all potential variants; and (v) machine learning to separate true segregating variation from machine artifacts common to next-generation sequencing technologies. We here discuss the application of these tools, instantiated in the Genome Analysis Toolkit, to deep whole-genome, whole-exome capture and multi-sample low-pass (â¼4×) 1000 Genomes Project datasets.
Asunto(s)
Variación Genética , Genotipo , Análisis de Secuencia de ADN/métodos , Interpretación Estadística de Datos , Bases de Datos de Ácidos Nucleicos , Exones , Genética de Población/métodos , Genética de Población/estadística & datos numéricos , Genoma Humano , Humanos , Polimorfismo de Nucleótido Simple , Alineación de Secuencia/métodos , Alineación de Secuencia/estadística & datos numéricos , Análisis de Secuencia de ADN/estadística & datos numéricos , Programas InformáticosRESUMEN
Global studies of transcript structure and abundance in cancer cells enable the systematic discovery of aberrations that contribute to carcinogenesis, including gene fusions, alternative splice isoforms, and somatic mutations. We developed a systematic approach to characterize the spectrum of cancer-associated mRNA alterations through integration of transcriptomic and structural genomic data, and we applied this approach to generate new insights into melanoma biology. Using paired-end massively parallel sequencing of cDNA (RNA-seq) together with analyses of high-resolution chromosomal copy number data, we identified 11 novel melanoma gene fusions produced by underlying genomic rearrangements, as well as 12 novel readthrough transcripts. We mapped these chimeric transcripts to base-pair resolution and traced them to their genomic origins using matched chromosomal copy number information. We also used these data to discover and validate base-pair mutations that accumulated in these melanomas, revealing a surprisingly high rate of somatic mutation and lending support to the notion that point mutations constitute the major driver of melanoma progression. Taken together, these results may indicate new avenues for target discovery in melanoma, while also providing a template for large-scale transcriptome studies across many tumor types.
Asunto(s)
Perfilación de la Expresión Génica , Melanoma/genética , Neoplasias Cutáneas/genética , Secuencia de Bases , Análisis Mutacional de ADN , Amplificación de Genes , Dosificación de Gen , Regulación Neoplásica de la Expresión Génica , Fusión Génica , Genómica/métodos , Humanos , Células K562 , Análisis por Apareamiento , Melanoma/metabolismo , Melanoma/patología , Polimorfismo Genético , Isoformas de Proteínas/genética , Análisis de Secuencia de ADN , Neoplasias Cutáneas/metabolismo , Neoplasias Cutáneas/patología , Integración de Sistemas , Células Tumorales CultivadasRESUMEN
Targeting genomic loci by massively parallel sequencing requires new methods to enrich templates to be sequenced. We developed a capture method that uses biotinylated RNA 'baits' to fish targets out of a 'pond' of DNA fragments. The RNA is transcribed from PCR-amplified oligodeoxynucleotides originally synthesized on a microarray, generating sufficient bait for multiple captures at concentrations high enough to drive the hybridization. We tested this method with 170-mer baits that target >15,000 coding exons (2.5 Mb) and four regions (1.7 Mb total) using Illumina sequencing as read-out. About 90% of uniquely aligning bases fell on or near bait sequence; up to 50% lay on exons proper. The uniformity was such that approximately 60% of target bases in the exonic 'catch', and approximately 80% in the regional catch, had at least half the mean coverage. One lane of Illumina sequence was sufficient to call high-confidence genotypes for 89% of the targeted exon space.