Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 102
Filter
Add more filters

Publication year range
1.
Cell ; 186(19): 4085-4099.e15, 2023 09 14.
Article in English | MEDLINE | ID: mdl-37714134

ABSTRACT

Many sequence variants have additive effects on blood lipid levels and, through that, on the risk of coronary artery disease (CAD). We show that variants also have non-additive effects and interact to affect lipid levels as well as affecting variance and correlations. Variance and correlation effects are often signatures of epistasis or gene-environmental interactions. These complex effects can translate into CAD risk. For example, Trp154Ter in FUT2 protects against CAD among subjects with the A1 blood group, whereas it associates with greater risk of CAD in others. His48Arg in ADH1B interacts with alcohol consumption to affect lipid levels and CAD. The effect of variants in TM6SF2 on blood lipids is greatest among those who never eat oily fish but absent from those who often do. This work demonstrates that variants that affect variance of quantitative traits can allow for the discovery of epistasis and interactions of variants with the environment.


Subject(s)
Coronary Artery Disease , Animals , Humans , Coronary Artery Disease/blood , Coronary Artery Disease/genetics , Epistasis, Genetic , Phenotype , Lipids/blood , ABO Blood-Group System
2.
Nature ; 622(7982): 348-358, 2023 Oct.
Article in English | MEDLINE | ID: mdl-37794188

ABSTRACT

High-throughput proteomics platforms measuring thousands of proteins in plasma combined with genomic and phenotypic information have the power to bridge the gap between the genome and diseases. Here we performed association studies of Olink Explore 3072 data generated by the UK Biobank Pharma Proteomics Project1 on plasma samples from more than 50,000 UK Biobank participants with phenotypic and genotypic data, stratifying on British or Irish, African and South Asian ancestries. We compared the results with those of a SomaScan v4 study on plasma from 36,000 Icelandic people2, for 1,514 of whom Olink data were also available. We found modest correlation between the two platforms. Although cis protein quantitative trait loci were detected for a similar absolute number of assays on the two platforms (2,101 on Olink versus 2,120 on SomaScan), the proportion of assays with such supporting evidence for assay performance was higher on the Olink platform (72% versus 43%). A considerable number of proteins had genomic associations that differed between the platforms. We provide examples where differences between platforms may influence conclusions drawn from the integration of protein levels with the study of diseases. We demonstrate how leveraging the diverse ancestries of participants in the UK Biobank helps to detect novel associations and refine genomic location. Our results show the value of the information provided by the two most commonly used high-throughput proteomics platforms and demonstrate the differences between them that at times provides useful complementarity.


Subject(s)
Blood Proteins , Disease Susceptibility , Genomics , Genotype , Phenotype , Proteomics , Humans , Africa/ethnology , Asia, Southern/ethnology , Biological Specimen Banks , Blood Proteins/analysis , Blood Proteins/genetics , Datasets as Topic , Genome, Human/genetics , Iceland/ethnology , Ireland/ethnology , Plasma/chemistry , Proteome/analysis , Proteome/genetics , Proteomics/methods , Quantitative Trait Loci , United Kingdom
3.
Nature ; 607(7920): 732-740, 2022 07.
Article in English | MEDLINE | ID: mdl-35859178

ABSTRACT

Detailed knowledge of how diversity in the sequence of the human genome affects phenotypic diversity depends on a comprehensive and reliable characterization of both sequences and phenotypic variation. Over the past decade, insights into this relationship have been obtained from whole-exome sequencing or whole-genome sequencing of large cohorts with rich phenotypic data1,2. Here we describe the analysis of whole-genome sequencing of 150,119 individuals from the UK Biobank3. This constitutes a set of high-quality variants, including 585,040,410 single-nucleotide polymorphisms, representing 7.0% of all possible human single-nucleotide polymorphisms, and 58,707,036 indels. This large set of variants allows us to characterize selection based on sequence variation within a population through a depletion rank score of windows along the genome. Depletion rank analysis shows that coding exons represent a small fraction of regions in the genome subject to strong sequence conservation. We define three cohorts within the UK Biobank: a large British Irish cohort, a smaller African cohort and a South Asian cohort. A haplotype reference panel is provided that allows reliable imputation of most variants carried by three or more sequenced individuals. We identified 895,055 structural variants and 2,536,688 microsatellites, groups of variants typically excluded from large-scale whole-genome sequencing studies. Using this formidable new resource, we provide several examples of trait associations for rare variants with large effects not found previously through studies based on whole-exome sequencing and/or imputation.


Subject(s)
Biological Specimen Banks , Databases, Genetic , Genetic Variation , Genome, Human , Genomics , Whole Genome Sequencing , Africa/ethnology , Asia/ethnology , Cohort Studies , Conserved Sequence , Exons/genetics , Genome, Human/genetics , Haplotypes/genetics , Humans , INDEL Mutation , Ireland/ethnology , Microsatellite Repeats , Polymorphism, Single Nucleotide/genetics , United Kingdom
4.
N Engl J Med ; 389(19): 1741-1752, 2023 Nov 09.
Article in English | MEDLINE | ID: mdl-37937776

ABSTRACT

BACKGROUND: In 2021, the American College of Medical Genetics and Genomics (ACMG) recommended reporting actionable genotypes in 73 genes associated with diseases for which preventive or therapeutic measures are available. Evaluations of the association of actionable genotypes in these genes with life span are currently lacking. METHODS: We assessed the prevalence of coding and splice variants in genes on the ACMG Secondary Findings, version 3.0 (ACMG SF v3.0), list in the genomes of 57,933 Icelanders. We assigned pathogenicity to all reviewed variants using reported evidence in the ClinVar database, the frequency of variants, and their associations with disease to create a manually curated set of actionable genotypes (variants). We assessed the relationship between these genotypes and life span and further examined the specific causes of death among carriers. RESULTS: Through manual curation of 4405 sequence variants in the ACMG SF v3.0 genes, we identified 235 actionable genotypes in 53 genes. Of the 57,933 participants, 2306 (4.0%) carried at least one actionable genotype. We found shorter median survival among persons carrying actionable genotypes than among noncarriers. Specifically, we found that carrying an actionable genotype in a cancer gene was associated with survival that was 3 years shorter than that among noncarriers, with causes of death among carriers attributed primarily to cancer-related conditions. Furthermore, we found evidence of association between carrying an actionable genotype in certain genes in the cardiovascular disease group and a reduced life span. CONCLUSIONS: On the basis of the ACMG SF v3.0 guidelines, we found that approximately 1 in 25 Icelanders carried an actionable genotype and that carrying such a genotype was associated with a reduced life span. (Funded by deCODE Genetics-Amgen.).


Subject(s)
Disease , Genomics , Longevity , Humans , Alleles , Genetic Testing , Genetic Variation , Genotype , Iceland/epidemiology , Longevity/genetics , Disease/genetics , Cardiovascular Diseases/genetics , Neoplasms/genetics
6.
Bioinformatics ; 39(8)2023 08 01.
Article in English | MEDLINE | ID: mdl-37535674

ABSTRACT

MOTIVATION: Meiotic recombination is the main driving force of human genetic diversity, along with mutations. Recombinations split into crossovers, separating large chromosomal regions originating from different homologous chromosomes, and non-crossovers (NCOs), where a small segment from one chromosome is embedded in a region originating from the homologous chromosome. NCOs are much less studied than mutations and crossovers as NCOs are short and can only be detected at markers heterozygous in the transmitting parent, leaving most of them undetectable. RESULTS: The detectable NCOs, known as gene conversions, hide information about NCOs, including their number and length, waiting to be unveiled. We introduce NCOurd, software, and algorithm, based on an expectation-maximization algorithm, to estimate the number of NCOs and their length distribution from gene conversion data. AVAILABILITY AND IMPLEMENTATION: https://github.com/DecodeGenetics/NCOurd.


Subject(s)
Crossing Over, Genetic , Gene Conversion , Humans , Heterozygote , Meiosis
7.
Bioinformatics ; 38(3): 604-611, 2022 01 12.
Article in English | MEDLINE | ID: mdl-34726732

ABSTRACT

MOTIVATION: With the increasing throughput of sequencing technologies, structural variant (SV) detection has become possible across tens of thousands of genomes. Non-reference sequence (NRS) variants have drawn less attention compared with other types of SVs due to the computational complexity of detecting them. When using short-read data, the detection of NRS variants inevitably involves a de novo assembly which requires high-quality sequence data at high coverage. Previous studies have demonstrated how sequence data of multiple genomes can be combined for the reliable detection of NRS variants. However, the algorithms proposed in these studies have limited scalability to larger sets of genomes. RESULTS: We introduce PopIns2, a tool to discover and characterize NRS variants in many genomes, which scales to considerably larger numbers of genomes than its predecessor PopIns. In this article, we briefly outline the PopIns2 workflow and highlight our novel algorithmic contributions. We developed an entirely new approach for merging contig assemblies of unaligned reads from many genomes into a single set of NRS using a colored de Bruijn graph. Our tests on simulated data indicate that the new merging algorithm ranks among the best approaches in terms of quality and reliability and that PopIns2 shows the best precision for a growing number of genomes processed. Results on the Polaris Diversity Cohort and a set of 1000 Icelandic human genomes demonstrate unmatched scalability for the application on population-scale datasets. AVAILABILITY AND IMPLEMENTATION: The source code of PopIns2 is available from https://github.com/kehrlab/PopIns2. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Algorithms , Software , Humans , Sequence Analysis, DNA/methods , Reproducibility of Results , Genome, Human , High-Throughput Nucleotide Sequencing/methods
8.
Nature ; 549(7673): 519-522, 2017 09 28.
Article in English | MEDLINE | ID: mdl-28959963

ABSTRACT

The characterization of mutational processes that generate sequence diversity in the human genome is of paramount importance both to medical genetics and to evolutionary studies. To understand how the age and sex of transmitting parents affect de novo mutations, here we sequence 1,548 Icelanders, their parents, and, for a subset of 225, at least one child, to 35× genome-wide coverage. We find 108,778 de novo mutations, both single nucleotide polymorphisms and indels, and determine the parent of origin of 42,961. The number of de novo mutations from mothers increases by 0.37 per year of age (95% CI 0.32-0.43), a quarter of the 1.51 per year from fathers (95% CI 1.45-1.57). The number of clustered mutations increases faster with the mother's age than with the father's, and the genomic span of maternal de novo mutation clusters is greater than that of paternal ones. The types of de novo mutation from mothers change substantially with age, with a 0.26% (95% CI 0.19-0.33%) decrease in cytosine-phosphate-guanine to thymine-phosphate-guanine (CpG>TpG) de novo mutations and a 0.33% (95% CI 0.28-0.38%) increase in C>G de novo mutations per year, respectively. Remarkably, these age-related changes are not distributed uniformly across the genome. A striking example is a 20 megabase region on chromosome 8p, with a maternal C>G mutation rate that is up to 50-fold greater than the rest of the genome. The age-related accumulation of maternal non-crossover gene conversions also mostly occurs within these regions. Increased sequence diversity and linkage disequilibrium of C>G variants within regions affected by excess maternal mutations indicate that the underlying mutational process has persisted in humans for thousands of years. Moreover, the regional excess of C>G variation in humans is largely shared by chimpanzees, less by gorillas, and is almost absent from orangutans. This demonstrates that sequence diversity in humans results from evolving interactions between age, sex, mutation type, and genomic location.


Subject(s)
Aging/genetics , Germ-Line Mutation/genetics , Maternal Age , Mutagenesis , Parents , Paternal Age , Adolescent , Adult , Aged , Animals , Child , Chromosomes, Human, Pair 8/genetics , Evolution, Molecular , Female , GC Rich Sequence , Genome, Human/genetics , Gorilla gorilla/genetics , Humans , INDEL Mutation , Iceland , Linkage Disequilibrium/genetics , Male , Middle Aged , Mutation Rate , Pan troglodytes/genetics , Polymorphism, Single Nucleotide , Pongo/genetics , Young Adult
10.
Bioinformatics ; 37(15): 2215-2217, 2021 08 09.
Article in English | MEDLINE | ID: mdl-33135043

ABSTRACT

MOTIVATION: Data analysis is requisite on reliable data. In genetics this includes verifying that the sample is not contaminated with another, a problem ubiquitous in biology. RESULTS: In human, and other diploid species, DNA contamination from the same species can be found by the presence of three haplotypes between polymorphic SNPs. read_haps is a tool that detects sample contamination from short read whole genome sequencing data. AVAILABILITYAND IMPLEMENTATION: github.com/DecodeGenetics/read_haps.


Subject(s)
Diploidy , High-Throughput Nucleotide Sequencing , Base Sequence , Haplotypes , Humans , Sequence Analysis, DNA , Software , Whole Genome Sequencing
11.
Arterioscler Thromb Vasc Biol ; 41(10): 2616-2628, 2021 10.
Article in English | MEDLINE | ID: mdl-34407635

ABSTRACT

Objective: Familial hypercholesterolemia (FH) is traditionally defined as a monogenic disease characterized by severely elevated LDL-C (low-density lipoprotein cholesterol) levels. In practice, FH is commonly a clinical diagnosis without confirmation of a causative mutation. In this study, we sought to characterize and compare monogenic and clinically defined FH in a large sample of Icelanders. Approach and Results: We whole-genome sequenced 49 962 Icelanders and imputed the identified variants into an overall sample of 166 281 chip-genotyped Icelanders. We identified 20 FH mutations in LDLR, APOB, and PCSK9 with combined prevalence of 1 in 836. Monogenic FH was associated with severely elevated LDL-C levels and increased risk of premature coronary disease, aortic valve stenosis, and high burden of coronary atherosclerosis. We used a modified version of the Dutch Lipid Clinic Network criteria to screen for the clinical FH phenotype among living adult participants (N=79 058). Clinical FH was found in 2.2% of participants, of whom only 5.2% had monogenic FH. Mutation-negative clinical FH has a strong polygenic basis. Both individuals with monogenic FH and individuals with mutation-negative clinical FH were markedly undertreated with cholesterol-lowering medications and only a minority attained an LDL-C target of <2.6 mmol/L (<100 mg/dL; 11.0% and 24.9%, respectively) or <1.8 mmol/L (<70 mg/dL; 0.0% and 5.2%, respectively), as recommended for primary prevention by European Society of Cardiology/European Atherosclerosis Society cholesterol guidelines. Conclusions: Clinically defined FH is a relatively common phenotype that is explained by monogenic FH in only a minority of cases. Both monogenic and clinical FH confer high cardiovascular risk but are markedly undertreated.


Subject(s)
Apolipoprotein B-100/genetics , Cardiovascular Diseases/genetics , Hyperlipoproteinemia Type II/genetics , Lipids/blood , Mutation , Proprotein Convertase 9/genetics , Receptors, LDL/genetics , Adult , Aged , Aged, 80 and over , Biomarkers/blood , Cardiovascular Diseases/diagnosis , Cardiovascular Diseases/ethnology , Cardiovascular Diseases/therapy , Female , Genetic Association Studies , Genetic Predisposition to Disease , Humans , Hydroxymethylglutaryl-CoA Reductase Inhibitors/therapeutic use , Hyperlipoproteinemia Type II/diagnosis , Hyperlipoproteinemia Type II/drug therapy , Hyperlipoproteinemia Type II/ethnology , Iceland/epidemiology , Male , Middle Aged , Phenotype , Prevalence , Prognosis , Risk Assessment , Risk Factors , Young Adult
12.
Hum Mol Genet ; 28(7): 1199-1211, 2019 04 01.
Article in English | MEDLINE | ID: mdl-30476138

ABSTRACT

Urine dipstick tests are widely used in routine medical care to diagnose kidney and urinary tract and metabolic diseases. Several environmental factors are known to affect the test results, whereas the effects of genetic diversity are largely unknown. We tested 32.5 million sequence variants for association with urinary biomarkers in a set of 150 274 Icelanders with urine dipstick measurements. We detected 20 association signals, of which 14 are novel, associating with at least one of five clinical entities defined by the urine dipstick: glucosuria, ketonuria, proteinuria, hematuria and urine pH. These include three independent glucosuria variants at SLC5A2, the gene encoding the sodium-dependent glucose transporter (SGLT2), a protein targeted pharmacologically to increase urinary glucose excretion in the treatment of diabetes. Two variants associating with proteinuria are in LRP2 and CUBN, encoding the co-transporters megalin and cubilin, respectively, that mediate proximal tubule protein uptake. One of the hematuria-associated variants is a rare, previously unreported 2.5 kb exonic deletion in COL4A3. Of the four signals associated with urine pH, we note that the pH-increasing alleles of two variants (POU2AF1, WDR72) associate significantly with increased risk of kidney stones. Our results reveal that genetic factors affect variability in urinary biomarkers, in both a disease dependent and independent context.


Subject(s)
Biomarkers/analysis , Biomarkers/urine , Genetic Variation/genetics , Adult , Aged , Alleles , Female , Hematuria/genetics , Hematuria/urine , Humans , Hydrogen-Ion Concentration , Iceland , Ketosis/genetics , Ketosis/urine , Kidney/metabolism , Male , Middle Aged , Proteinuria/genetics , Proteinuria/urine , Sodium-Glucose Transporter 2/genetics , Whole Genome Sequencing/methods
13.
Bioinformatics ; 36(7): 2269-2271, 2020 04 01.
Article in English | MEDLINE | ID: mdl-31804671

ABSTRACT

SUMMARY: popSTR2 is an update and augmentation of our previous work 'popSTR: a population-based microsatellite genotyper'. To make genotyping sensitive to inter-sample differences, we supply a kernel to estimate sample-specific slippage rates. For clinical sequencing purposes, a panel of known pathogenic repeat expansions is provided along with a script that scans and flags for manual inspection markers indicative of a pathogenic expansion. Like its predecessor, popSTR2 allows for joint genotyping of samples at a population scale. We now provide a binning method that makes the microsatellite genotypes more amenable to analysis within standard association pipelines and can increase association power. AVAILABILITY AND IMPLEMENTATION: https://github.com/DecodeGenetics/popSTR. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Microsatellite Repeats , Software , Genotype
14.
BMC Med Inform Decis Mak ; 19(1): 27, 2019 02 01.
Article in English | MEDLINE | ID: mdl-30709348

ABSTRACT

BACKGROUND: Although osteoporosis is an easily diagnosed and treatable condition, many individuals remain untreated. Clinical decision support systems might increase appropriate treatment of osteoporosis. We designed the Osteoporosis Advisor (OPAD), a computerized tool to support physicians managing osteoporosis at the point-of-care. The present study compares the treatment recommendations provided by OPAD, an expert physician and the National Osteoporosis Guideline Group (NOGG). METHODS: We performed a retrospective analysis of 259 patients attending the outpatient osteoporosis clinic at the University Hospital in Iceland. We entered each patient's data into the OPAD and recorded the OPAD diagnostic comments, 10-year risk of major osteoporotic fracture and treatment options. We compared OPAD recommendations to those given by the osteoporosis specialist, and to those of the NOGG. RESULTS: Risk estimates made by OPAD were highly correlated with those from FRAX (r = 0.99, 95% CI 0.99, 1.00 without femoral neck BMD; r = 0.98, 95% CI, 0.97, 0.99 with femoral neck BMD. Reassurance was recommended by the expert, NOGG and the OPAD in 68, 63 and 52% of cases, respectively. Likewise, intervention was recommended by the expert, NOGG, and the OPAD in 32, 37 and 48% of cases, respectively. The OPAD demonstrated moderate agreement with the physician (kappa 0.51, 95% CI 0.41, 0.61) and even higher agreement with NOGG (kappa 0.69, 95% CI 0.60, 0.77). CONCLUSION: Primary care physicians can use the OPAD to assess and treat patients' skeletal health. Recommendations given by OPAD are consistent with expert opinion and existing guidelines.


Subject(s)
Decision Support Systems, Clinical/standards , Osteology/methods , Osteoporosis/diagnosis , Osteoporosis/therapy , Practice Guidelines as Topic/standards , Risk Assessment/standards , Aged , Female , Humans , Middle Aged , Physicians, Primary Care , Pilot Projects , Point-of-Care Systems , Retrospective Studies
15.
J Cell Mol Med ; 22(3): 1574-1582, 2018 03.
Article in English | MEDLINE | ID: mdl-29266682

ABSTRACT

To find sequence variants affecting prostate cancer (PCA) susceptibility in an unscreened Romanian population we use a genome-wide association study (GWAS). The study population included 990 unrelated pathologically confirmed PCA cases and 1034 male controls. DNA was genotyped using Illumina SNP arrays, and 24.295.558 variants were imputed using the 1000 Genomes data set. An association test was performed between the imputed markers and PCA. A systematic literature review for variants associated with PCA risk identified 115 unique variants that were tested in the Romanian sample set. Thirty of the previously reported SNPs replicated (P-value < 0.05), with the strongest associations observed at: 8q24.21, 11q13.3, 6q25.3, 5p15.33, 22q13.2, 17q12 and 3q13.2. The replicated variants showing the most significant association in Romania are rs1016343 at 8q24.21 (P = 2.2 × 10-4 ), rs7929962 at 11q13.3 (P = 2.7 × 10-4 ) and rs9364554 at 6q25.2 (P = 4.7 × 10-4 ). None of the variants tested in the Romanian GWAS reached genome-wide significance (P-value <5 × 10-8 ) but 807 markers had P-values <1 × 10-4 . Here, we report the results of the first GWAS of PCA performed in a Romanian population. Our study provides evidence that a substantial fraction of previously validated PCA variants associate with risk in this unscreened Romanian population.


Subject(s)
Biomarkers, Tumor/genetics , Genetic Loci , Genetic Predisposition to Disease , Polymorphism, Single Nucleotide , Prostate-Specific Antigen/genetics , Prostatic Neoplasms/diagnosis , Aged , Aged, 80 and over , Alleles , Biomarkers, Tumor/blood , Case-Control Studies , Gene Expression Profiling , Gene Frequency , Genome, Human , Genome-Wide Association Study , Humans , Male , Middle Aged , Neoplasm Staging , Oligonucleotide Array Sequence Analysis , Prostate-Specific Antigen/blood , Prostatic Neoplasms/blood , Prostatic Neoplasms/genetics , Prostatic Neoplasms/pathology , Risk , Romania
16.
J Cell Mol Med ; 22(12): 6068-6076, 2018 12.
Article in English | MEDLINE | ID: mdl-30324682

ABSTRACT

Two familial forms of colorectal cancer (CRC), Lynch syndrome (LS) and familial adenomatous polyposis (FAP), are caused by rare mutations in DNA mismatch repair genes (MLH1, MSH2, MSH6, PMS2) and the genes APC and MUTYH, respectively. No information is available on the presence of high-risk CRC mutations in the Romanian population. We performed whole-genome sequencing of 61 Romanian CRC cases with a family history of cancer and/or early onset of disease, focusing the analysis on candidate variants in the LS and FAP genes. The frequencies of all candidate variants were assessed in a cohort of 688 CRC cases and 4567 controls. Immunohistochemical (IHC) staining for MLH1, MSH2, MSH6, and PMS2 was performed on tumour tissue. We identified 11 candidate variants in 11 cases; six variants in MLH1, one in MSH6, one in PMS2, and three in APC. Combining information on the predicted impact of the variants on the proteins, IHC results and previous reports, we found three novel pathogenic variants (MLH1:p.Lys84ThrfsTer4, MLH1:p.Ala586CysfsTer7, PMS2:p.Arg211ThrfsTer38), and two novel variants that are unlikely to be pathogenic. Also, we confirmed three previously published pathogenic LS variants and suggest to reclassify a previously reported variant of uncertain significance to pathogenic (MLH1:c.1559-1G>C).


Subject(s)
Adenomatous Polyposis Coli/genetics , Colorectal Neoplasms, Hereditary Nonpolyposis/genetics , DNA Mismatch Repair/genetics , Genetic Predisposition to Disease , Adenomatous Polyposis Coli/epidemiology , Adenomatous Polyposis Coli/pathology , Adult , Aged , Colorectal Neoplasms, Hereditary Nonpolyposis/epidemiology , Colorectal Neoplasms, Hereditary Nonpolyposis/pathology , DNA Glycosylases/genetics , DNA Methylation/genetics , DNA-Binding Proteins/genetics , Female , Humans , Male , Middle Aged , Mismatch Repair Endonuclease PMS2/genetics , MutL Protein Homolog 1/genetics , MutS Homolog 2 Protein/genetics , Mutation , Risk Factors , Romania/epidemiology
17.
Hum Mol Genet ; 25(5): 1008-18, 2016 Mar 01.
Article in English | MEDLINE | ID: mdl-26740556

ABSTRACT

Transcriptional and splicing anomalies have been observed in intron 8 of the CASP8 gene (encoding procaspase-8) in association with cutaneous basal-cell carcinoma (BCC) and linked to a germline SNP rs700635. Here, we show that the rs700635[C] allele, which is associated with increased risk of BCC and breast cancer, is protective against prostate cancer [odds ratio (OR) = 0.91, P = 1.0 × 10(-6)]. rs700635[C] is also associated with failures to correctly splice out CASP8 intron 8 in breast and prostate tumours and in corresponding normal tissues. Investigation of rs700635[C] carriers revealed that they have a human-specific short interspersed element-variable number of tandem repeat-Alu (SINE-VNTR-Alu), subfamily-E retrotransposon (SVA-E) inserted into CASP8 intron 8. The SVA-E shows evidence of prior activity, because it has transduced some CASP8 sequences during subsequent retrotransposition events. Whole-genome sequence (WGS) data were used to tag the SVA-E with a surrogate SNP rs1035142[T] (r(2) = 0.999), which showed associations with both the splicing anomalies (P = 6.5 × 10(-32)) and with protection against prostate cancer (OR = 0.91, P = 3.8 × 10(-7)).


Subject(s)
Breast Neoplasms/genetics , Carcinoma, Basal Cell/genetics , Caspase 8/genetics , Prostatic Neoplasms/genetics , RNA Splicing , Retroelements , Skin Neoplasms/genetics , Adult , Aged , Aged, 80 and over , Alleles , Base Sequence , Breast Neoplasms/metabolism , Breast Neoplasms/pathology , Carcinoma, Basal Cell/metabolism , Carcinoma, Basal Cell/pathology , Caspase 8/metabolism , Female , Genome-Wide Association Study , Humans , Introns , Male , Middle Aged , Molecular Sequence Data , Odds Ratio , Polymorphism, Single Nucleotide , Prostatic Neoplasms/metabolism , Prostatic Neoplasms/pathology , Prostatic Neoplasms/prevention & control , Protective Factors , Skin Neoplasms/metabolism , Skin Neoplasms/pathology
18.
Bioinformatics ; 33(24): 4041-4048, 2017 Dec 15.
Article in English | MEDLINE | ID: mdl-27591079

ABSTRACT

MOTIVATION: Microsatellites, also known as short tandem repeats (STRs), are tracts of repetitive DNA sequences containing motifs ranging from two to six bases. Microsatellites are one of the most abundant type of variation in the human genome, after single nucleotide polymorphisms (SNPs) and Indels. Microsatellite analysis has a wide range of applications, including medical genetics, forensics and construction of genetic genealogy. However, microsatellite variations are rarely considered in whole-genome sequencing studies, in large due to a lack of tools capable of analyzing them. RESULTS: Here we present a microsatellite genotyper, optimized for Illumina WGS data, which is both faster and more accurate than other methods previously presented. There are two main ingredients to our improvements. First we reduce the amount of sequencing data necessary for creating microsatellite profiles by using previously aligned sequencing data. Second, we use population information to train microsatellite and individual specific error profiles. By comparing our genotyping results to genotypes generated by capillary electrophoresis we show that our error rates are 50% lower than those of lobSTR, another program specifically developed to determine microsatellite genotypes. AVAILABILITY AND IMPLEMENTATION: Source code is available on Github: https://github.com/DecodeGenetics/popSTR. CONTACT: snaedis.kristmundsdottir@decode.is or bjarni.halldorsson@decode.is.


Subject(s)
Microsatellite Repeats , Genotype , Humans , Software , Whole Genome Sequencing
19.
Bioinformatics ; 32(7): 961-7, 2016 04 01.
Article in English | MEDLINE | ID: mdl-25926346

ABSTRACT

MOTIVATION: The detection of genomic structural variation (SV) has advanced tremendously in recent years due to progress in high-throughput sequencing technologies. Novel sequence insertions, insertions without similarity to a human reference genome, have received less attention than other types of SVs due to the computational challenges in their detection from short read sequencing data, which inherently involves de novo assembly. De novo assembly is not only computationally challenging, but also requires high-quality data. Although the reads from a single individual may not always meet this requirement, using reads from multiple individuals can increase power to detect novel insertions. RESULTS: We have developed the program PopIns, which can discover and characterize non-reference insertions of 100 bp or longer on a population scale. In this article, we describe the approach we implemented in PopIns. It takes as input a reads-to-reference alignment, assembles unaligned reads using a standard assembly tool, merges the contigs of different individuals into high-confidence sequences, anchors the merged sequences into the reference genome, and finally genotypes all individuals for the discovered insertions. Our tests on simulated data indicate that the merging step greatly improves the quality and reliability of predicted insertions and that PopIns shows significantly better recall and precision than the recent tool MindTheGap. Preliminary results on a dataset of 305 Icelanders demonstrate the practicality of the new approach. AVAILABILITY AND IMPLEMENTATION: The source code of PopIns is available from http://github.com/bkehr/popins CONTACT: birte.kehr@decode.is SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Computational Biology/methods , High-Throughput Nucleotide Sequencing , Sequence Analysis, DNA , Genomic Structural Variation , Humans , Mutagenesis, Insertional , Reproducibility of Results
20.
Bioinformatics ; 30(24): 3541-7, 2014 Dec 15.
Article in English | MEDLINE | ID: mdl-25355787

ABSTRACT

MOTIVATION: Several applications in bioinformatics, such as genome assemblers and error corrections methods, rely on counting and keeping track of k-mers (substrings of length k). Histograms of k-mer frequencies can give valuable insight into the underlying distribution and indicate the error rate and genome size sampled in the sequencing experiment. RESULTS: We present KmerStream, a streaming algorithm for estimating the number of distinct k-mers present in high-throughput sequencing data. The algorithm runs in time linear in the size of the input and the space requirement are logarithmic in the size of the input. We derive a simple model that allows us to estimate the error rate of the sequencing experiment, as well as the genome size, using only the aggregate statistics reported by KmerStream. As an application we show how KmerStream can be used to compute the error rate of a DNA sequencing experiment. We run KmerStream on a set of 2656 whole genome sequenced individuals and compare the error rate to quality values reported by the sequencing equipment. We discover that while the quality values alone are largely reliable as a predictor of error rate, there is considerable variability in the error rates between sequencing runs, even when accounting for reported quality values.


Subject(s)
Algorithms , High-Throughput Nucleotide Sequencing/methods , Sequence Analysis, DNA/methods , Genome Size , Genome, Human , Genomics/methods , Humans , Software
SELECTION OF CITATIONS
SEARCH DETAIL