RESUMEN
Around 60% of individuals with neurodevelopmental disorders (NDD) remain undiagnosed after comprehensive genetic testing, primarily of protein-coding genes1. Large genome-sequenced cohorts are improving our ability to discover new diagnoses in the non-coding genome. Here we identify the non-coding RNA RNU4-2 as a syndromic NDD gene. RNU4-2 encodes the U4 small nuclear RNA (snRNA), which is a critical component of the U4/U6.U5 tri-snRNP complex of the major spliceosome2. We identify an 18 base pair region of RNU4-2 mapping to two structural elements in the U4/U6 snRNA duplex (the T-loop and stem III) that is severely depleted of variation in the general population, but in which we identify heterozygous variants in 115 individuals with NDD. Most individuals (77.4%) have the same highly recurrent single base insertion (n.64_65insT). In 54 individuals in whom it could be determined, the de novo variants were all on the maternal allele. We demonstrate that RNU4-2 is highly expressed in the developing human brain, in contrast to RNU4-1 and other U4 homologues. Using RNA sequencing, we show how 5' splice-site use is systematically disrupted in individuals with RNU4-2 variants, consistent with the known role of this region during spliceosome activation. Finally, we estimate that variants in this 18 base pair region explain 0.4% of individuals with NDD. This work underscores the importance of non-coding genes in rare disorders and will provide a diagnosis to thousands of individuals with NDD worldwide.
Asunto(s)
Mutación , Trastornos del Neurodesarrollo , ARN Nuclear Pequeño , Adolescente , Niño , Preescolar , Femenino , Humanos , Lactante , Masculino , Adulto Joven , Alelos , Encéfalo/crecimiento & desarrollo , Encéfalo/metabolismo , Heterocigoto , Trastornos del Neurodesarrollo/genética , Sitios de Empalme de ARN/genética , ARN Nuclear Pequeño/genética , Empalmosomas/genética , Síndrome , Enfermedades Raras/genética , Regulación del Desarrollo de la Expresión GénicaRESUMEN
Mutations in the germline generates all evolutionary genetic variation and is a cause of genetic disease. Parental age is the primary determinant of the number of new germline mutations in an individual's genome1,2. Here we analysed the genome-wide sequences of 21,879 families with rare genetic diseases and identified 12 individuals with a hypermutated genome with between two and seven times more de novo single-nucleotide variants than expected. In most families (9 out of 12), the excess mutations came from the father. Two families had genetic drivers of germline hypermutation, with fathers carrying damaging genetic variation in DNA-repair genes. For five of the families, paternal exposure to chemotherapeutic agents before conception was probably a key driver of hypermutation. Our results suggest that the germline is well protected from mutagenic effects, hypermutation is rare, the number of excess mutations is relatively modest and most individuals with a hypermutated genome will not have a genetic disease.
Asunto(s)
Enfermedades Genéticas Congénitas , Células Germinativas , Mutación de Línea Germinal , Factores de Edad , Enfermedades Genéticas Congénitas/genética , Mutación de Línea Germinal/genética , Humanos , Masculino , Mutagénesis/genética , Mutación , Padres , Polimorfismo de Nucleótido SimpleRESUMEN
BACKGROUND: Pediatric disorders include a range of highly penetrant, genetically heterogeneous conditions amenable to genomewide diagnostic approaches. Finding a molecular diagnosis is challenging but can have profound lifelong benefits. METHODS: We conducted a large-scale sequencing study involving more than 13,500 families with probands with severe, probably monogenic, difficult-to-diagnose developmental disorders from 24 regional genetics services in the United Kingdom and Ireland. Standardized phenotypic data were collected, and exome sequencing and microarray analyses were performed to investigate novel genetic causes. We developed an iterative variant analysis pipeline and reported candidate variants to clinical teams for validation and diagnostic interpretation to inform communication with families. Multiple regression analyses were performed to evaluate factors affecting the probability of diagnosis. RESULTS: A total of 13,449 probands were included in the analyses. On average, we reported 1.0 candidate variant per parent-offspring trio and 2.5 variants per singleton proband. Using clinical and computational approaches to variant classification, we made a diagnosis in approximately 41% of probands (5502 of 13,449). Of 3599 probands in trios who received a diagnosis by clinical assertion, approximately 76% had a pathogenic de novo variant. Another 22% of probands (2997 of 13,449) had variants of uncertain significance in genes that were strongly linked to monogenic developmental disorders. Recruitment in a parent-offspring trio had the largest effect on the probability of diagnosis (odds ratio, 4.70; 95% confidence interval [CI], 4.16 to 5.31). Probands were less likely to receive a diagnosis if they were born extremely prematurely (i.e., 22 to 27 weeks' gestation; odds ratio, 0.39; 95% CI, 0.22 to 0.68), had in utero exposure to antiepileptic medications (odds ratio, 0.44; 95% CI, 0.29 to 0.67), had mothers with diabetes (odds ratio, 0.52; 95% CI, 0.41 to 0.67), or were of African ancestry (odds ratio, 0.51; 95% CI, 0.31 to 0.78). CONCLUSIONS: Among probands with severe, probably monogenic, difficult-to-diagnose developmental disorders, multimodal analysis of genomewide data had good diagnostic power, even after previous attempts at diagnosis. (Funded by the Health Innovation Challenge Fund and Wellcome Sanger Institute.).
Asunto(s)
Genómica , Enfermedades Raras , Niño , Humanos , Exoma , Irlanda/epidemiología , Reino Unido/epidemiología , Enfermedades Raras/diagnóstico , Enfermedades Raras/epidemiología , Enfermedades Raras/genética , Análisis de Secuencia por Matrices de Oligonucleótidos , Estudios de Asociación Genética , Trastornos del Neurodesarrollo/diagnóstico , Trastornos del Neurodesarrollo/genética , Anomalías Congénitas/diagnóstico , Anomalías Congénitas/genética , Trastornos del Crecimiento/diagnóstico , Trastornos del Crecimiento/genética , Facies , Trastornos de la Conducta Infantil/diagnóstico , Trastornos de la Conducta Infantil/genética , Enfermedades Genéticas Congénitas/diagnóstico , Enfermedades Genéticas Congénitas/genéticaRESUMEN
De novo mutations in protein-coding genes are a well-established cause of developmental disorders1. However, genes known to be associated with developmental disorders account for only a minority of the observed excess of such de novo mutations1,2. Here, to identify previously undescribed genes associated with developmental disorders, we integrate healthcare and research exome-sequence data from 31,058 parent-offspring trios of individuals with developmental disorders, and develop a simulation-based statistical test to identify gene-specific enrichment of de novo mutations. We identified 285 genes that were significantly associated with developmental disorders, including 28 that had not previously been robustly associated with developmental disorders. Although we detected more genes associated with developmental disorders, much of the excess of de novo mutations in protein-coding genes remains unaccounted for. Modelling suggests that more than 1,000 genes associated with developmental disorders have not yet been described, many of which are likely to be less penetrant than the currently known genes. Research access to clinical diagnostic datasets will be critical for completing the map of genes associated with developmental disorders.
Asunto(s)
Análisis Mutacional de ADN , Análisis de Datos , Bases de Datos Genéticas , Conjuntos de Datos como Asunto , Atención a la Salud/estadística & datos numéricos , Discapacidades del Desarrollo/genética , Enfermedades Genéticas Congénitas/genética , Estudios de Cohortes , Variaciones en el Número de Copia de ADN/genética , Discapacidades del Desarrollo/diagnóstico , Europa (Continente) , Femenino , Enfermedades Genéticas Congénitas/diagnóstico , Mutación de Línea Germinal/genética , Haploinsuficiencia/genética , Humanos , Masculino , Mutación Missense/genética , Penetrancia , Muerte Perinatal , Tamaño de la MuestraRESUMEN
Structural variation (SV) describes a broad class of genetic variation greater than 50 bp in size. SVs can cause a wide range of genetic diseases and are prevalent in rare developmental disorders (DDs). Individuals presenting with DDs are often referred for diagnostic testing with chromosomal microarrays (CMAs) to identify large copy-number variants (CNVs) and/or with single-gene, gene-panel, or exome sequencing (ES) to identify single-nucleotide variants, small insertions/deletions, and CNVs. However, individuals with pathogenic SVs undetectable by conventional analysis often remain undiagnosed. Consequently, we have developed the tool InDelible, which interrogates short-read sequencing data for split-read clusters characteristic of SV breakpoints. We applied InDelible to 13,438 probands with severe DDs recruited as part of the Deciphering Developmental Disorders (DDD) study and discovered 63 rare, damaging variants in genes previously associated with DDs missed by standard SNV, indel, or CNV discovery approaches. Clinical review of these 63 variants determined that about half (30/63) were plausibly pathogenic. InDelible was particularly effective at ascertaining variants between 21 and 500 bp in size and increased the total number of potentially pathogenic variants identified by DDD in this size range by 42.9%. Of particular interest were seven confirmed de novo variants in MECP2, which represent 35.0% of all de novo protein-truncating variants in MECP2 among DDD study participants. InDelible provides a framework for the discovery of pathogenic SVs that are most likely missed by standard analytical workflows and has the potential to improve the diagnostic yield of ES across a broad range of genetic diseases.
Asunto(s)
Discapacidades del Desarrollo/diagnóstico , Discapacidades del Desarrollo/genética , Secuenciación del Exoma/métodos , Niño , Femenino , Humanos , Masculino , Proteína 2 de Unión a Metil-CpG/genéticaRESUMEN
Clinical genetic testing of protein-coding regions identifies a likely causative variant in only around half of developmental disorder (DD) cases. The contribution of regulatory variation in non-coding regions to rare disease, including DD, remains very poorly understood. We screened 9,858 probands from the Deciphering Developmental Disorders (DDD) study for de novo mutations in the 5' untranslated regions (5' UTRs) of genes within which variants have previously been shown to cause DD through a dominant haploinsufficient mechanism. We identified four single-nucleotide variants and two copy-number variants upstream of MEF2C in a total of ten individual probands. We developed multiple bespoke and orthogonal experimental approaches to demonstrate that these variants cause DD through three distinct loss-of-function mechanisms, disrupting transcription, translation, and/or protein function. These non-coding region variants represent 23% of likely diagnoses identified in MEF2C in the DDD cohort, but these would all be missed in standard clinical genetics approaches. Nonetheless, these variants are readily detectable in exome sequence data, with 30.7% of 5' UTR bases across all genes well covered in the DDD dataset. Our analyses show that non-coding variants upstream of genes within which coding variants are known to cause DD are an important cause of severe disease and demonstrate that analyzing 5' UTRs can increase diagnostic yield. We also show how non-coding variants can help inform both the disease-causing mechanism underlying protein-coding variants and dosage tolerance of the gene.
Asunto(s)
Regiones no Traducidas 5' , Discapacidades del Desarrollo/etiología , Predisposición Genética a la Enfermedad , Mutación con Pérdida de Función , Niño , Estudios de Cohortes , Variaciones en el Número de Copia de ADN , Discapacidades del Desarrollo/patología , Humanos , Factores de Transcripción MEF2/genética , Secuenciación del ExomaRESUMEN
Bulk and single-cell DNA sequencing has enabled reconstructing clonal substructures of somatic tissues from frequency and cooccurrence patterns of somatic variants. However, approaches to characterize phenotypic variations between clones are not established. Here we present cardelino (https://github.com/single-cell-genetics/cardelino), a computational method for inferring the clonal tree configuration and the clone of origin of individual cells assayed using single-cell RNA-seq (scRNA-seq). Cardelino flexibly integrates information from imperfect clonal trees inferred based on bulk exome-seq data, and sparse variant alleles expressed in scRNA-seq data. We apply cardelino to a published cancer dataset and to newly generated matched scRNA-seq and exome-seq data from 32 human dermal fibroblast lines, identifying hundreds of differentially expressed genes between cells from different somatic clones. These genes are frequently enriched for cell cycle and proliferation pathways, indicating a role for cell division genes in somatic evolution in healthy skin.
Asunto(s)
Fibroblastos/metabolismo , Perfilación de la Expresión Génica/métodos , Análisis de Secuencia de ARN/métodos , Análisis de la Célula Individual/métodos , Programas Informáticos , Algoritmos , Ciclo Celular , Proliferación Celular , Humanos , Melanoma , Mutación , TranscriptomaRESUMEN
This corrects the article DOI: 10.1038/nature22403.
RESUMEN
Technology utilizing human induced pluripotent stem cells (iPS cells) has enormous potential to provide improved cellular models of human disease. However, variable genetic and phenotypic characterization of many existing iPS cell lines limits their potential use for research and therapy. Here we describe the systematic generation, genotyping and phenotyping of 711 iPS cell lines derived from 301 healthy individuals by the Human Induced Pluripotent Stem Cells Initiative. Our study outlines the major sources of genetic and phenotypic variation in iPS cells and establishes their suitability as models of complex human traits and cancer. Through genome-wide profiling we find that 5-46% of the variation in different iPS cell phenotypes, including differentiation capacity and cellular morphology, arises from differences between individuals. Additionally, we assess the phenotypic consequences of genomic copy-number alterations that are repeatedly observed in iPS cells. In addition, we present a comprehensive map of common regulatory variants affecting the transcriptome of human pluripotent cells.
Asunto(s)
Variación Genética/genética , Células Madre Pluripotentes Inducidas/metabolismo , Células Cultivadas , Reprogramación Celular/genética , Variaciones en el Número de Copia de ADN/genética , Regulación de la Expresión Génica/genética , Genotipo , Humanos , Especificidad de Órganos , Fenotipo , Control de Calidad , Sitios de Carácter Cuantitativo/genética , Transcriptoma/genéticaRESUMEN
The contribution of rare and low-frequency variants to human traits is largely unexplored. Here we describe insights from sequencing whole genomes (low read depth, 7×) or exomes (high read depth, 80×) of nearly 10,000 individuals from population-based and disease collections. In extensively phenotyped cohorts we characterize over 24 million novel sequence variants, generate a highly accurate imputation reference panel and identify novel alleles associated with levels of triglycerides (APOB), adiponectin (ADIPOQ) and low-density lipoprotein cholesterol (LDLR and RGAG1) from single-marker and rare variant aggregation tests. We describe population structure and functional annotation of rare and low-frequency variants, use the data to estimate the benefits of sequencing for association studies, and summarize lessons from disease-specific collections. Finally, we make available an extensive resource, including individual-level genetic and phenotypic data and web-based tools to facilitate the exploration of association results.
Asunto(s)
Enfermedad/genética , Variación Genética/genética , Genoma Humano/genética , Salud , Adiponectina/sangre , Alelos , Estudios de Cohortes , Exoma/genética , Femenino , Predisposición Genética a la Enfermedad/genética , Genética Médica , Genética de Población , Estudio de Asociación del Genoma Completo , Genómica , Humanos , Metabolismo de los Lípidos/genética , Masculino , Anotación de Secuencia Molecular , Receptores de LDL/genética , Estándares de Referencia , Análisis de Secuencia de ADN , Triglicéridos/sangre , Reino UnidoRESUMEN
The extent to which low-frequency (minor allele frequency (MAF) between 1-5%) and rare (MAF ≤ 1%) variants contribute to complex traits and disease in the general population is mainly unknown. Bone mineral density (BMD) is highly heritable, a major predictor of osteoporotic fractures, and has been previously associated with common genetic variants, as well as rare, population-specific, coding variants. Here we identify novel non-coding genetic variants with large effects on BMD (ntotal = 53,236) and fracture (ntotal = 508,253) in individuals of European ancestry from the general population. Associations for BMD were derived from whole-genome sequencing (n = 2,882 from UK10K (ref. 10); a population-based genome sequencing consortium), whole-exome sequencing (n = 3,549), deep imputation of genotyped samples using a combined UK10K/1000 Genomes reference panel (n = 26,534), and de novo replication genotyping (n = 20,271). We identified a low-frequency non-coding variant near a novel locus, EN1, with an effect size fourfold larger than the mean of previously reported common variants for lumbar spine BMD (rs11692564(T), MAF = 1.6%, replication effect size = +0.20 s.d., Pmeta = 2 × 10(-14)), which was also associated with a decreased risk of fracture (odds ratio = 0.85; P = 2 × 10(-11); ncases = 98,742 and ncontrols = 409,511). Using an En1(cre/flox) mouse model, we observed that conditional loss of En1 results in low bone mass, probably as a consequence of high bone turnover. We also identified a novel low-frequency non-coding variant with large effects on BMD near WNT16 (rs148771817(T), MAF = 1.2%, replication effect size = +0.41 s.d., Pmeta = 1 × 10(-11)). In general, there was an excess of association signals arising from deleterious coding and conserved non-coding variants. These findings provide evidence that low-frequency non-coding variants have large effects on BMD and fracture, thereby providing rationale for whole-genome sequencing and improved imputation reference panels to study the genetic architecture of complex traits and disease in the general population.
Asunto(s)
Densidad Ósea/genética , Fracturas Óseas/genética , Genoma Humano/genética , Proteínas de Homeodominio/genética , Animales , Huesos/metabolismo , Modelos Animales de Enfermedad , Europa (Continente)/etnología , Exoma/genética , Femenino , Frecuencia de los Genes/genética , Predisposición Genética a la Enfermedad/genética , Variación Genética/genética , Genómica , Genotipo , Humanos , Ratones , Análisis de Secuencia de ADN , Población Blanca/genética , Proteínas Wnt/genéticaRESUMEN
The Canaanites inhabited the Levant region during the Bronze Age and established a culture that became influential in the Near East and beyond. However, the Canaanites, unlike most other ancient Near Easterners of this period, left few surviving textual records and thus their origin and relationship to ancient and present-day populations remain unclear. In this study, we sequenced five whole genomes from â¼3,700-year-old individuals from the city of Sidon, a major Canaanite city-state on the Eastern Mediterranean coast. We also sequenced the genomes of 99 individuals from present-day Lebanon to catalog modern Levantine genetic diversity. We find that a Bronze Age Canaanite-related ancestry was widespread in the region, shared among urban populations inhabiting the coast (Sidon) and inland populations (Jordan) who likely lived in farming societies or were pastoral nomads. This Canaanite-related ancestry derived from mixture between local Neolithic populations and eastern migrants genetically related to Chalcolithic Iranians. We estimate, using linkage-disequilibrium decay patterns, that admixture occurred 6,600-3,550 years ago, coinciding with recorded massive population movements in Mesopotamia during the mid-Holocene. We show that present-day Lebanese derive most of their ancestry from a Canaanite-related population, which therefore implies substantial genetic continuity in the Levant since at least the Bronze Age. In addition, we find Eurasian ancestry in the Lebanese not present in Bronze Age or earlier Levantines. We estimate that this Eurasian ancestry arrived in the Levant around 3,750-2,170 years ago during a period of successive conquests by distant populations.
Asunto(s)
ADN Mitocondrial/genética , Etnicidad/genética , Genética de Población/métodos , Genoma Humano/genética , Variación Genética/genética , Historia Antigua , Humanos , Líbano , Desequilibrio de Ligamiento , Masculino , Población Blanca/genéticaRESUMEN
Deep sequence-based imputation can enhance the discovery power of genome-wide association studies by assessing previously unexplored variation across the common- and low-frequency spectra. We applied a hybrid whole-genome sequencing (WGS) and deep imputation approach to examine the broader allelic architecture of 12 anthropometric traits associated with height, body mass, and fat distribution in up to 267,616 individuals. We report 106 genome-wide significant signals that have not been previously identified, including 9 low-frequency variants pointing to functional candidates. Of the 106 signals, 6 are in genomic regions that have not been implicated with related traits before, 28 are independent signals at previously reported regions, and 72 represent previously reported signals for a different anthropometric trait. 71% of signals reside within genes and fine mapping resolves 23 signals to one or two likely causal variants. We confirm genetic overlap between human monogenic and polygenic anthropometric traits and find signal enrichment in cis expression QTLs in relevant tissues. Our results highlight the potential of WGS strategies to enhance biologically relevant discoveries across the frequency spectrum.
Asunto(s)
Antropometría , Genoma Humano , Estudio de Asociación del Genoma Completo , Sitios de Carácter Cuantitativo/genética , Análisis de Secuencia de ADN/métodos , Estatura/genética , Estudios de Cohortes , Metilación de ADN/genética , Bases de Datos Genéticas , Femenino , Variación Genética , Humanos , Lipodistrofia/genética , Masculino , Metaanálisis como Asunto , Obesidad/genética , Mapeo Físico de Cromosoma , Caracteres Sexuales , Síndrome , Reino UnidoRESUMEN
MOTIVATION: Very low-depth sequencing has been proposed as a cost-effective approach to capture low-frequency and rare variation in complex trait association studies. However, a full characterization of the genotype quality and association power for very low-depth sequencing designs is still lacking. RESULTS: We perform cohort-wide whole-genome sequencing (WGS) at low depth in 1239 individuals (990 at 1× depth and 249 at 4× depth) from an isolated population, and establish a robust pipeline for calling and imputing very low-depth WGS genotypes from standard bioinformatics tools. Using genotyping chip, whole-exome sequencing (75× depth) and high-depth (22×) WGS data in the same samples, we examine in detail the sensitivity of this approach, and show that imputed 1× WGS recapitulates 95.2% of variants found by imputed GWAS with an average minor allele concordance of 97% for common and low-frequency variants. In our study, 1× further allowed the discovery of 140 844 true low-frequency variants with 73% genotype concordance when compared to high-depth WGS data. Finally, using association results for 57 quantitative traits, we show that very low-depth WGS is an efficient alternative to imputed GWAS chip designs, allowing the discovery of up to twice as many true association signals than the classical imputed GWAS design. AVAILABILITY AND IMPLEMENTATION: The HELIC genotype and WGS datasets have been deposited to the European Genome-phenome Archive (https://www.ebi.ac.uk/ega/home): EGAD00010000518; EGAD00010000522; EGAD00010000610; EGAD00001001636, EGAD00001001637. The peakplotter software is available at https://github.com/wtsi-team144/peakplotter, the transformPhenotype app can be downloaded at https://github.com/wtsi-team144/transformPhenotype. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento , Polimorfismo de Nucleótido Simple , Genotipo , Humanos , Herencia Multifactorial , Secuenciación Completa del GenomaRESUMEN
The predominantly African origin of all modern human populations is well established, but the route taken out of Africa is still unclear. Two alternative routes, via Egypt and Sinai or across the Bab el Mandeb strait into Arabia, have traditionally been proposed as feasible gateways in light of geographic, paleoclimatic, archaeological, and genetic evidence. Distinguishing among these alternatives has been difficult. We generated 225 whole-genome sequences (225 at 8× depth, of which 8 were increased to 30×; Illumina HiSeq 2000) from six modern Northeast African populations (100 Egyptians and five Ethiopian populations each represented by 25 individuals). West Eurasian components were masked out, and the remaining African haplotypes were compared with a panel of sub-Saharan African and non-African genomes. We showed that masked Northeast African haplotypes overall were more similar to non-African haplotypes and more frequently present outside Africa than were any sets of haplotypes derived from a West African population. Furthermore, the masked Egyptian haplotypes showed these properties more markedly than the masked Ethiopian haplotypes, pointing to Egypt as the more likely gateway in the exodus to the rest of the world. Using five Ethiopian and three Egyptian high-coverage masked genomes and the multiple sequentially Markovian coalescent (MSMC) approach, we estimated the genetic split times of Egyptians and Ethiopians from non-African populations at 55,000 and 65,000 years ago, respectively, whereas that of West Africans was estimated to be 75,000 years ago. Both the haplotype and MSMC analyses thus suggest a predominant northern route out of Africa via Egypt.
Asunto(s)
Evolución Biológica , Población Negra/genética , Genoma Humano/genética , Migración Humana/historia , Secuencia de Bases , Antiguo Egipto , Etiopía , Geografía , Haplotipos/genética , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Historia Antigua , Humanos , Cadenas de Markov , Modelos Genéticos , Datos de Secuencia Molecular , Análisis de Componente PrincipalRESUMEN
MOTIVATION: Prediction of functional variant consequences is an important part of sequencing pipelines, allowing the categorization and prioritization of genetic variants for follow up analysis. However, current predictors analyze variants as isolated events, which can lead to incorrect predictions when adjacent variants alter the same codon, or when a frame-shifting indel is followed by a frame-restoring indel. Exploiting known haplotype information when making consequence predictions can resolve these issues. RESULTS: BCFtools/csq is a fast program for haplotype-aware consequence calling which can take into account known phase. Consequence predictions are changed for 501 of 5019 compound variants found in the 81.7M variants in the 1000 Genomes Project data, with an average of 139 compound variants per haplotype. Predictions match existing tools when run in localized mode, but the program is an order of magnitude faster and requires an order of magnitude less memory. AVAILABILITY AND IMPLEMENTATION: The program is freely available for commercial and non-commercial use in the BCFtools package which is available for download from http://samtools.github.io/bcftools . CONTACT: pd3@sanger.ac.uk. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Variación Genética , Genoma Humano , Haplotipos , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Algoritmos , Genómica/métodos , Humanos , Mutación INDELRESUMEN
UNLABELLED: Runs of homozygosity (RoHs) are genomic stretches of a diploid genome that show identical alleles on both chromosomes. Longer RoHs are unlikely to have arisen by chance but are likely to denote autozygosity, whereby both copies of the genome descend from the same recent ancestor. Early tools to detect RoH used genotype array data, but substantially more information is available from sequencing data. Here, we present and evaluate BCFtools/RoH, an extension to the BCFtools software package, that detects regions of autozygosity in sequencing data, in particular exome data, using a hidden Markov model. By applying it to simulated data and real data from the 1000 Genomes Project we estimate its accuracy and show that it has higher sensitivity and specificity than existing methods under a range of sequencing error rates and levels of autozygosity. AVAILABILITY AND IMPLEMENTATION: BCFtools/RoH and its associated binary/source files are freely available from https://github.com/samtools/BCFtools CONTACT: vn2@sanger.ac.uk or pd3@sanger.ac.uk SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento , Exoma , Genómica , Genotipo , Homocigoto , Programas InformáticosRESUMEN
We report genome sequences of 17 inbred strains of laboratory mice and identify almost ten times more variants than previously known. We use these genomes to explore the phylogenetic history of the laboratory mouse and to examine the functional consequences of allele-specific variation on transcript abundance, revealing that at least 12% of transcripts show a significant tissue-specific expression bias. By identifying candidate functional variants at 718 quantitative trait loci we show that the molecular nature of functional variants and their position relative to genes vary according to the effect size of the locus. These sequences provide a starting point for a new era in the functional analysis of a key model organism.
Asunto(s)
Regulación de la Expresión Génica/genética , Variación Genética/genética , Genoma/genética , Ratones Endogámicos/genética , Ratones/genética , Fenotipo , Alelos , Animales , Animales de Laboratorio/genética , Genómica , Ratones/clasificación , Ratones Endogámicos C57BL/genética , Filogenia , Sitios de Carácter Cuantitativo/genéticaRESUMEN
Autosomal recessive coding variants are well-known causes of rare disorders. We quantified the contribution of these variants to developmental disorders in a large, ancestrally diverse cohort comprising 29,745 trios, of whom 20.4% had genetically inferred non-European ancestries. The estimated fraction of patients attributable to exome-wide autosomal recessive coding variants ranged from ~2-19% across genetically inferred ancestry groups and was significantly correlated with average autozygosity. Established autosomal recessive developmental disorder-associated (ARDD) genes explained 84.0% of the total autosomal recessive coding burden, and 34.4% of the burden in these established genes was explained by variants not already reported as pathogenic in ClinVar. Statistical analyses identified two novel ARDD genes: KBTBD2 and ZDHHC16. This study expands our understanding of the genetic architecture of developmental disorders across diverse genetically inferred ancestry groups and suggests that improving strategies for interpreting missense variants in known ARDD genes may help diagnose more patients than discovering the remaining genes.