Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 28
Filtrar
Más filtros

Banco de datos
País/Región como asunto
Tipo del documento
Intervalo de año de publicación
1.
Bioinformatics ; 40(Suppl 2): ii11-ii19, 2024 09 01.
Artículo en Inglés | MEDLINE | ID: mdl-39230689

RESUMEN

MOTIVATION: Complex structural variants (SVs) are genomic rearrangements that involve multiple segments of DNA. They contribute to human diversity and have been shown to cause Mendelian disease. Nevertheless, our abilities to analyse complex SVs are very limited. As opposed to deletions and other canonical types of SVs, there are no established tools that have explicitly been designed for analysing complex SVs. RESULTS: Here, we describe a new computational approach that we specifically designed for genotyping complex SVs in short-read sequenced genomes. Given a variant description, our approach computes genotype-specific probability distributions for observing aligned read pairs with a wide range of properties. Subsequently, these distributions can be used to efficiently determine the most likely genotype for any set of aligned read pairs observed in a sequenced genome. In addition, we use these distributions to compute a genotyping difficulty for a given variant, which predicts the amount of data needed to achieve a reliable call. Careful evaluation confirms that our approach outperforms other genotypers by making reliable genotype predictions across both simulated and real data. On up to 7829 human genomes, we achieve high concordance with population-genetic assumptions and expected inheritance patterns. On simulated data, we show that precision correlates well with our prediction of genotyping difficulty. This together with low memory and time requirements makes our approach well-suited for application in biomedical studies involving small to very large numbers of short-read sequenced genomes. AVAILABILITY AND IMPLEMENTATION: Source code is available at https://github.com/kehrlab/Complex-SV-Genotyping.


Asunto(s)
Genoma Humano , Variación Estructural del Genoma , Análisis de Secuencia de ADN , Programas Informáticos , Humanos , Análisis de Secuencia de ADN/métodos , Genotipo , Técnicas de Genotipaje/métodos , Algoritmos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Genómica/métodos
2.
Bioinformatics ; 38(3): 604-611, 2022 01 12.
Artículo en Inglés | MEDLINE | ID: mdl-34726732

RESUMEN

MOTIVATION: With the increasing throughput of sequencing technologies, structural variant (SV) detection has become possible across tens of thousands of genomes. Non-reference sequence (NRS) variants have drawn less attention compared with other types of SVs due to the computational complexity of detecting them. When using short-read data, the detection of NRS variants inevitably involves a de novo assembly which requires high-quality sequence data at high coverage. Previous studies have demonstrated how sequence data of multiple genomes can be combined for the reliable detection of NRS variants. However, the algorithms proposed in these studies have limited scalability to larger sets of genomes. RESULTS: We introduce PopIns2, a tool to discover and characterize NRS variants in many genomes, which scales to considerably larger numbers of genomes than its predecessor PopIns. In this article, we briefly outline the PopIns2 workflow and highlight our novel algorithmic contributions. We developed an entirely new approach for merging contig assemblies of unaligned reads from many genomes into a single set of NRS using a colored de Bruijn graph. Our tests on simulated data indicate that the new merging algorithm ranks among the best approaches in terms of quality and reliability and that PopIns2 shows the best precision for a growing number of genomes processed. Results on the Polaris Diversity Cohort and a set of 1000 Icelandic human genomes demonstrate unmatched scalability for the application on population-scale datasets. AVAILABILITY AND IMPLEMENTATION: The source code of PopIns2 is available from https://github.com/kehrlab/PopIns2. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Algoritmos , Programas Informáticos , Humanos , Análisis de Secuencia de ADN/métodos , Reproducibilidad de los Resultados , Genoma Humano , Secuenciación de Nucleótidos de Alto Rendimiento/métodos
3.
Nature ; 549(7673): 519-522, 2017 09 28.
Artículo en Inglés | MEDLINE | ID: mdl-28959963

RESUMEN

The characterization of mutational processes that generate sequence diversity in the human genome is of paramount importance both to medical genetics and to evolutionary studies. To understand how the age and sex of transmitting parents affect de novo mutations, here we sequence 1,548 Icelanders, their parents, and, for a subset of 225, at least one child, to 35× genome-wide coverage. We find 108,778 de novo mutations, both single nucleotide polymorphisms and indels, and determine the parent of origin of 42,961. The number of de novo mutations from mothers increases by 0.37 per year of age (95% CI 0.32-0.43), a quarter of the 1.51 per year from fathers (95% CI 1.45-1.57). The number of clustered mutations increases faster with the mother's age than with the father's, and the genomic span of maternal de novo mutation clusters is greater than that of paternal ones. The types of de novo mutation from mothers change substantially with age, with a 0.26% (95% CI 0.19-0.33%) decrease in cytosine-phosphate-guanine to thymine-phosphate-guanine (CpG>TpG) de novo mutations and a 0.33% (95% CI 0.28-0.38%) increase in C>G de novo mutations per year, respectively. Remarkably, these age-related changes are not distributed uniformly across the genome. A striking example is a 20 megabase region on chromosome 8p, with a maternal C>G mutation rate that is up to 50-fold greater than the rest of the genome. The age-related accumulation of maternal non-crossover gene conversions also mostly occurs within these regions. Increased sequence diversity and linkage disequilibrium of C>G variants within regions affected by excess maternal mutations indicate that the underlying mutational process has persisted in humans for thousands of years. Moreover, the regional excess of C>G variation in humans is largely shared by chimpanzees, less by gorillas, and is almost absent from orangutans. This demonstrates that sequence diversity in humans results from evolving interactions between age, sex, mutation type, and genomic location.


Asunto(s)
Envejecimiento/genética , Mutación de Línea Germinal/genética , Edad Materna , Mutagénesis , Padres , Edad Paterna , Adolescente , Adulto , Anciano , Animales , Niño , Cromosomas Humanos Par 8/genética , Evolución Molecular , Femenino , Secuencia Rica en GC , Genoma Humano/genética , Gorilla gorilla/genética , Humanos , Mutación INDEL , Islandia , Desequilibrio de Ligamiento/genética , Masculino , Persona de Mediana Edad , Tasa de Mutación , Pan troglodytes/genética , Polimorfismo de Nucleótido Simple , Pongo/genética , Adulto Joven
4.
Bioinformatics ; 37(19): 3128-3135, 2021 Oct 11.
Artículo en Inglés | MEDLINE | ID: mdl-33830196

RESUMEN

MOTIVATION: Genome Architecture Mapping (GAM) was recently introduced as a digestion- and ligation-free method to detect chromatin conformation. Orthogonal to existing approaches based on chromatin conformation capture (3C), GAM's ability to capture both inter- and intra-chromosomal contacts from low amounts of input data makes it particularly well suited for allele-specific analyses in a clinical setting. Allele-specific analyses are powerful tools to investigate the effects of genetic variants on many cellular phenotypes including chromatin conformation, but require the haplotypes of the individuals under study to be known a priori. So far, however, no algorithm exists for haplotype reconstruction and phasing of genetic variants from GAM data, hindering the allele-specific analysis of chromatin contact points in non-model organisms or individuals with unknown haplotypes. RESULTS: We present GAMIBHEAR, a tool for accurate haplotype reconstruction from GAM data. GAMIBHEAR aggregates allelic co-observation frequencies from GAM data and employs a GAM-specific probabilistic model of haplotype capture to optimize phasing accuracy. Using a hybrid mouse embryonic stem cell line with known haplotype structure as a benchmark dataset, we assess correctness and completeness of the reconstructed haplotypes, and demonstrate the power of GAMIBHEAR to infer accurate genome-wide haplotypes from GAM data. AVAILABILITY AND IMPLEMENTATION: GAMIBHEAR is available as an R package under the open-source GPL-2 license at https://bitbucket.org/schwarzlab/gamibhear. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

5.
Hum Mol Genet ; 28(7): 1199-1211, 2019 04 01.
Artículo en Inglés | MEDLINE | ID: mdl-30476138

RESUMEN

Urine dipstick tests are widely used in routine medical care to diagnose kidney and urinary tract and metabolic diseases. Several environmental factors are known to affect the test results, whereas the effects of genetic diversity are largely unknown. We tested 32.5 million sequence variants for association with urinary biomarkers in a set of 150 274 Icelanders with urine dipstick measurements. We detected 20 association signals, of which 14 are novel, associating with at least one of five clinical entities defined by the urine dipstick: glucosuria, ketonuria, proteinuria, hematuria and urine pH. These include three independent glucosuria variants at SLC5A2, the gene encoding the sodium-dependent glucose transporter (SGLT2), a protein targeted pharmacologically to increase urinary glucose excretion in the treatment of diabetes. Two variants associating with proteinuria are in LRP2 and CUBN, encoding the co-transporters megalin and cubilin, respectively, that mediate proximal tubule protein uptake. One of the hematuria-associated variants is a rare, previously unreported 2.5 kb exonic deletion in COL4A3. Of the four signals associated with urine pH, we note that the pH-increasing alleles of two variants (POU2AF1, WDR72) associate significantly with increased risk of kidney stones. Our results reveal that genetic factors affect variability in urinary biomarkers, in both a disease dependent and independent context.


Asunto(s)
Biomarcadores/análisis , Biomarcadores/orina , Variación Genética/genética , Adulto , Anciano , Alelos , Femenino , Hematuria/genética , Hematuria/orina , Humanos , Concentración de Iones de Hidrógeno , Islandia , Cetosis/genética , Cetosis/orina , Riñón/metabolismo , Masculino , Persona de Mediana Edad , Proteinuria/genética , Proteinuria/orina , Transportador 2 de Sodio-Glucosa/genética , Secuenciación Completa del Genoma/métodos
6.
Hum Mol Genet ; 26(12): 2364-2376, 2017 06 15.
Artículo en Inglés | MEDLINE | ID: mdl-28398513

RESUMEN

Common sequence variants at the haptoglobin gene (HP) have been associated with blood lipid levels. Through whole-genome sequencing of 8,453 Icelanders, we discovered a splice donor founder mutation in HP (NM_001126102.1:c.190 + 1G > C, minor allele frequency = 0.56%). This mutation occurs on the HP1 allele of the common copy number variant in HP and leads to a loss of function of HP1. It associates with lower levels of haptoglobin (P = 2.1 × 10-54), higher levels of non-high density lipoprotein cholesterol (ß = 0.26 mmol/l, P = 2.6 × 10-9) and greater risk of coronary artery disease (odds ratio = 1.30, 95% confidence interval: 1.10-1.54, P = 0.0024). Through haplotype analysis and with RNA sequencing, we provide evidence of a causal relationship between one of the two haptoglobin isoforms, namely Hp1, and lower levels of non-HDL cholesterol. Furthermore, we show that the HP1 allele associates with various other quantitative biological traits.


Asunto(s)
Enfermedad de la Arteria Coronaria/genética , Haptoglobinas/genética , Adulto , Alelos , Secuencia de Bases , Enfermedad de la Arteria Coronaria/metabolismo , Variaciones en el Número de Copia de ADN/genética , Femenino , Frecuencia de los Genes/genética , Estudios de Asociación Genética/métodos , Variación Genética , Haptoglobinas/metabolismo , Humanos , Islandia , Lípidos/sangre , Lípidos/genética , Lipoproteínas/genética , Masculino , Mutación , Oportunidad Relativa , Sitios de Empalme de ARN/genética , Factores de Riesgo
7.
Hum Mol Genet ; 25(5): 1008-18, 2016 Mar 01.
Artículo en Inglés | MEDLINE | ID: mdl-26740556

RESUMEN

Transcriptional and splicing anomalies have been observed in intron 8 of the CASP8 gene (encoding procaspase-8) in association with cutaneous basal-cell carcinoma (BCC) and linked to a germline SNP rs700635. Here, we show that the rs700635[C] allele, which is associated with increased risk of BCC and breast cancer, is protective against prostate cancer [odds ratio (OR) = 0.91, P = 1.0 × 10(-6)]. rs700635[C] is also associated with failures to correctly splice out CASP8 intron 8 in breast and prostate tumours and in corresponding normal tissues. Investigation of rs700635[C] carriers revealed that they have a human-specific short interspersed element-variable number of tandem repeat-Alu (SINE-VNTR-Alu), subfamily-E retrotransposon (SVA-E) inserted into CASP8 intron 8. The SVA-E shows evidence of prior activity, because it has transduced some CASP8 sequences during subsequent retrotransposition events. Whole-genome sequence (WGS) data were used to tag the SVA-E with a surrogate SNP rs1035142[T] (r(2) = 0.999), which showed associations with both the splicing anomalies (P = 6.5 × 10(-32)) and with protection against prostate cancer (OR = 0.91, P = 3.8 × 10(-7)).


Asunto(s)
Neoplasias de la Mama/genética , Carcinoma Basocelular/genética , Caspasa 8/genética , Neoplasias de la Próstata/genética , Empalme del ARN , Retroelementos , Neoplasias Cutáneas/genética , Adulto , Anciano , Anciano de 80 o más Años , Alelos , Secuencia de Bases , Neoplasias de la Mama/metabolismo , Neoplasias de la Mama/patología , Carcinoma Basocelular/metabolismo , Carcinoma Basocelular/patología , Caspasa 8/metabolismo , Femenino , Estudio de Asociación del Genoma Completo , Humanos , Intrones , Masculino , Persona de Mediana Edad , Datos de Secuencia Molecular , Oportunidad Relativa , Polimorfismo de Nucleótido Simple , Neoplasias de la Próstata/metabolismo , Neoplasias de la Próstata/patología , Neoplasias de la Próstata/prevención & control , Factores Protectores , Neoplasias Cutáneas/metabolismo , Neoplasias Cutáneas/patología
8.
Bioinformatics ; 33(24): 4041-4048, 2017 Dec 15.
Artículo en Inglés | MEDLINE | ID: mdl-27591079

RESUMEN

MOTIVATION: Microsatellites, also known as short tandem repeats (STRs), are tracts of repetitive DNA sequences containing motifs ranging from two to six bases. Microsatellites are one of the most abundant type of variation in the human genome, after single nucleotide polymorphisms (SNPs) and Indels. Microsatellite analysis has a wide range of applications, including medical genetics, forensics and construction of genetic genealogy. However, microsatellite variations are rarely considered in whole-genome sequencing studies, in large due to a lack of tools capable of analyzing them. RESULTS: Here we present a microsatellite genotyper, optimized for Illumina WGS data, which is both faster and more accurate than other methods previously presented. There are two main ingredients to our improvements. First we reduce the amount of sequencing data necessary for creating microsatellite profiles by using previously aligned sequencing data. Second, we use population information to train microsatellite and individual specific error profiles. By comparing our genotyping results to genotypes generated by capillary electrophoresis we show that our error rates are 50% lower than those of lobSTR, another program specifically developed to determine microsatellite genotypes. AVAILABILITY AND IMPLEMENTATION: Source code is available on Github: https://github.com/DecodeGenetics/popSTR. CONTACT: snaedis.kristmundsdottir@decode.is or bjarni.halldorsson@decode.is.


Asunto(s)
Repeticiones de Microsatélite , Genotipo , Humanos , Programas Informáticos , Secuenciación Completa del Genoma
9.
Bioinformatics ; 32(14): 2202-4, 2016 07 15.
Artículo en Inglés | MEDLINE | ID: mdl-27153590

RESUMEN

UNLABELLED: Advances in sequencing capacity have led to the generation of unprecedented amounts of genomic data. The processing of this data frequently leads to I/O bottlenecks, e. g. when analyzing a small genomic region across a large number of samples. The largest I/O burden is, however, often not imposed by the amount of data needed for the analysis but rather by index files that help retrieving this data. We have developed chopBAI, a program that can chop a BAM index (BAI) file into small pieces. The program outputs a list of BAI files each indexing a specified genomic interval. The output files are much smaller in size but maintain compatibility with existing software tools. We show how preprocessing BAI files with chopBAI can lead to a reduction of I/O by more than 95% during the analysis of 10 kb genomic regions, eventually enabling the joint analysis of more than 10 000 individuals. AVAILABILITY AND IMPLEMENTATION: The software is implemented in C ++, GPL licensed and available at http://github.com/DecodeGenetics/chopBAIContact:birte.kehr@decode.is.


Asunto(s)
Biología Computacional/métodos , Genómica/métodos , Programas Informáticos , Humanos
10.
Bioinformatics ; 32(7): 961-7, 2016 04 01.
Artículo en Inglés | MEDLINE | ID: mdl-25926346

RESUMEN

MOTIVATION: The detection of genomic structural variation (SV) has advanced tremendously in recent years due to progress in high-throughput sequencing technologies. Novel sequence insertions, insertions without similarity to a human reference genome, have received less attention than other types of SVs due to the computational challenges in their detection from short read sequencing data, which inherently involves de novo assembly. De novo assembly is not only computationally challenging, but also requires high-quality data. Although the reads from a single individual may not always meet this requirement, using reads from multiple individuals can increase power to detect novel insertions. RESULTS: We have developed the program PopIns, which can discover and characterize non-reference insertions of 100 bp or longer on a population scale. In this article, we describe the approach we implemented in PopIns. It takes as input a reads-to-reference alignment, assembles unaligned reads using a standard assembly tool, merges the contigs of different individuals into high-confidence sequences, anchors the merged sequences into the reference genome, and finally genotypes all individuals for the discovered insertions. Our tests on simulated data indicate that the merging step greatly improves the quality and reliability of predicted insertions and that PopIns shows significantly better recall and precision than the recent tool MindTheGap. Preliminary results on a dataset of 305 Icelanders demonstrate the practicality of the new approach. AVAILABILITY AND IMPLEMENTATION: The source code of PopIns is available from http://github.com/bkehr/popins CONTACT: birte.kehr@decode.is SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Biología Computacional/métodos , Secuenciación de Nucleótidos de Alto Rendimiento , Análisis de Secuencia de ADN , Variación Estructural del Genoma , Humanos , Mutagénesis Insercional , Reproducibilidad de los Resultados
11.
Bioinformatics ; 30(4): 540-8, 2014 Feb 15.
Artículo en Inglés | MEDLINE | ID: mdl-24336806

RESUMEN

MOTIVATION: Owing to recent advancements in high-throughput technologies, protein-protein interaction networks of more and more species become available in public databases. The question of how to identify functionally conserved proteins across species attracts a lot of attention in computational biology. Network alignments provide a systematic way to solve this problem. However, most existing alignment tools encounter limitations in tackling this problem. Therefore, the demand for faster and more efficient alignment tools is growing. RESULTS: We present a fast and accurate algorithm, NetCoffee, which allows to find a global alignment of multiple protein-protein interaction networks. NetCoffee searches for a global alignment by maximizing a target function using simulated annealing on a set of weighted bipartite graphs that are constructed using a triplet approach similar to T-Coffee. To assess its performance, NetCoffee was applied to four real datasets. Our results suggest that NetCoffee remedies several limitations of previous algorithms, outperforms all existing alignment tools in terms of speed and nevertheless identifies biologically meaningful alignments. AVAILABILITY: The source code and data are freely available for download under the GNU GPL v3 license at https://code.google.com/p/netcoffee/.


Asunto(s)
Algoritmos , Biología Computacional/métodos , Redes Reguladoras de Genes , Mapeo de Interacción de Proteínas/métodos , Proteínas/metabolismo , Alineación de Secuencia/métodos , Animales , Bacterias , Bases de Datos de Proteínas , Humanos , Modelos Biológicos , Programas Informáticos
12.
BMC Bioinformatics ; 15: 99, 2014 Apr 09.
Artículo en Inglés | MEDLINE | ID: mdl-24712884

RESUMEN

BACKGROUND: Recent advances in rapid, low-cost sequencing have opened up the opportunity to study complete genome sequences. The computational approach of multiple genome alignment allows investigation of evolutionarily related genomes in an integrated fashion, providing a basis for downstream analyses such as rearrangement studies and phylogenetic inference.Graphs have proven to be a powerful tool for coping with the complexity of genome-scale sequence alignments. The potential of graphs to intuitively represent all aspects of genome alignments led to the development of graph-based approaches for genome alignment. These approaches construct a graph from a set of local alignments, and derive a genome alignment through identification and removal of graph substructures that indicate errors in the alignment. RESULTS: We compare the structures of commonly used graphs in terms of their abilities to represent alignment information. We describe how the graphs can be transformed into each other, and identify and classify graph substructures common to one or more graphs. Based on previous approaches, we compile a list of modifications that remove these substructures. CONCLUSION: We show that crucial pieces of alignment information, associated with inversions and duplications, are not visible in the structure of all graphs. If we neglect vertex or edge labels, the graphs differ in their information content. Still, many ideas are shared among all graph-based approaches. Based on these findings, we outline a conceptual framework for graph-based genome alignment that can assist in the development of future genome alignment tools.


Asunto(s)
Genómica/métodos , Alineación de Secuencia/métodos , Algoritmos , Gráficos por Computador , Genoma
13.
BMC Bioinformatics ; 12 Suppl 9: S15, 2011 Oct 05.
Artículo en Inglés | MEDLINE | ID: mdl-22151882

RESUMEN

BACKGROUND: Large-scale comparison of genomic sequences requires reliable tools for the search of local alignments. Practical local aligners are in general fast, but heuristic, and hence sometimes miss significant matches. RESULTS: We present here the local pairwise aligner STELLAR that has full sensitivity for ε-alignments, i.e. guarantees to report all local alignments of a given minimal length and maximal error rate. The aligner is composed of two steps, filtering and verification. We apply the SWIFT algorithm for lossless filtering, and have developed a new verification strategy that we prove to be exact. Our results on simulated and real genomic data confirm and quantify the conjecture that heuristic tools like BLAST or BLAT miss a large percentage of significant local alignments. CONCLUSIONS: STELLAR is very practical and fast on very long sequences which makes it a suitable new tool for finding local alignments between genomic sequences under the edit distance model. Binaries are freely available for Linux, Windows, and Mac OS X at http://www.seqan.de/projects/stellar. The source code is freely distributed with the SeqAn C++ library version 1.3 and later at http://www.seqan.de.


Asunto(s)
Genómica/métodos , Alineación de Secuencia/métodos , Programas Informáticos , Algoritmos , Animales , Drosophila/genética
14.
Med Genet ; 33(2): 133-145, 2021 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-38836034

RESUMEN

High-throughput sequencing techniques have significantly increased the molecular diagnosis rate for patients with monogenic disorders. This is primarily due to a substantially increased identification rate of disease mutations in the coding sequence, primarily SNVs and indels. Further progress is hampered by difficulties in the detection of structural variants and the interpretation of variants outside the coding sequence. In this review, we provide an overview about how novel sequencing techniques and state-of-the-art algorithms can be used to discover small and structural variants across the whole genome and introduce bioinformatic tools for the prediction of effects variants may have in the non-coding part of the genome.

15.
Nat Commun ; 12(1): 730, 2021 02 01.
Artículo en Inglés | MEDLINE | ID: mdl-33526789

RESUMEN

Thousands of genomic structural variants (SVs) segregate in the human population and can impact phenotypic traits and diseases. Their identification in whole-genome sequence data of large cohorts is a major computational challenge. Most current approaches identify SVs in single genomes and afterwards merge the identified variants into a joint call set across many genomes. We describe the approach PopDel, which directly identifies deletions of about 500 to at least 10,000 bp in length in data of many genomes jointly, eliminating the need for subsequent variant merging. PopDel scales to tens of thousands of genomes as we demonstrate in evaluations on up to 49,962 genomes. We show that PopDel reliably reports common, rare and de novo deletions. On genomes with available high-confidence reference call sets PopDel shows excellent recall and precision. Genotype inheritance patterns in up to 6794 trios indicate that genotypes predicted by PopDel are more reliable than those of previous SV callers. Furthermore, PopDel's running time is competitive with the fastest tested previous tools. The demonstrated scalability and accuracy of PopDel enables routine scans for deletions in large-scale sequencing studies.


Asunto(s)
Genoma Humano/genética , Variación Estructural del Genoma , Metagenómica/métodos , Eliminación de Secuencia , Estudios de Factibilidad , Femenino , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Patrón de Herencia , Masculino , Reproducibilidad de los Resultados , Análisis de Secuencia de ADN
16.
Datenbank Spektrum ; 21(3): 255-260, 2021.
Artículo en Inglés | MEDLINE | ID: mdl-34786019

RESUMEN

Today's scientific data analysis very often requires complex Data Analysis Workflows (DAWs) executed over distributed computational infrastructures, e.g., clusters. Much research effort is devoted to the tuning and performance optimization of specific workflows for specific clusters. However, an arguably even more important problem for accelerating research is the reduction of development, adaptation, and maintenance times of DAWs. We describe the design and setup of the Collaborative Research Center (CRC) 1404 "FONDA -- Foundations of Workflows for Large-Scale Scientific Data Analysis", in which roughly 50 researchers jointly investigate new technologies, algorithms, and models to increase the portability, adaptability, and dependability of DAWs executed over distributed infrastructures. We describe the motivation behind our project, explain its underlying core concepts, introduce FONDA's internal structure, and sketch our vision for the future of workflow-based scientific data analysis. We also describe some lessons learned during the "making of" a CRC in Computer Science with strong interdisciplinary components, with the aim to foster similar endeavors.

17.
Circ Genom Precis Med ; 14(1): e003029, 2021 02.
Artículo en Inglés | MEDLINE | ID: mdl-33315477

RESUMEN

BACKGROUND: Loss-of-function mutations in the LDL (low-density lipoprotein) receptor gene (LDLR) cause elevated levels of LDL cholesterol and premature cardiovascular disease. To date, a gain-of-function mutation in LDLR with a large effect on LDL cholesterol levels has not been described. Here, we searched for sequence variants in LDLR that have a large effect on LDL cholesterol levels. METHODS: We analyzed whole-genome sequencing data from 43 202 Icelanders. Single-nucleotide polymorphisms and structural variants including deletions, insertions, and duplications were genotyped using whole-genome sequencing-based data. LDL cholesterol associations were carried out in a sample of >100 000 Icelanders with genetic information (imputed or whole-genome sequencing). Molecular analyses were performed using RNA sequencing and protein expression assays in Epstein-Barr virus-transformed lymphocytes. RESULTS: We discovered a 2.5-kb deletion (del2.5) overlapping the 3' untranslated region of LDLR in 7 heterozygous carriers from a single family. Mean level of LDL cholesterol was 74% lower in del2.5 carriers than in 101 851 noncarriers, a difference of 2.48 mmol/L (96 mg/dL; P=8.4×10-8). Del2.5 results in production of an alternative mRNA isoform with a truncated 3' untranslated region. The truncation leads to a loss of target sites for microRNAs known to repress translation of LDLR. In Epstein-Barr virus-transformed lymphocytes derived from del2.5 carriers, expression of alternative mRNA isoform was 1.84-fold higher than the wild-type isoform (P=0.0013), and there was 1.79-fold higher surface expression of the LDL receptor than in noncarriers (P=0.0086). We did not find a highly penetrant detrimental impact of lifelong very low levels of LDL cholesterol due to del2.5 on health of the carriers. CONCLUSIONS: Del2.5 is the first reported gain-of-function mutation in LDLR causing a large reduction in LDL cholesterol. These data point to a role for alternative polyadenylation of LDLR mRNA as a potent regulator of LDL receptor expression in humans.


Asunto(s)
LDL-Colesterol/sangre , Receptores de LDL/genética , Regiones no Traducidas 3' , Empalme Alternativo , Mutación con Ganancia de Función , Eliminación de Gen , Vectores Genéticos/genética , Vectores Genéticos/metabolismo , Herpesvirus Humano 4/genética , Heterocigoto , Humanos , Hiperlipoproteinemia Tipo II/genética , Hiperlipoproteinemia Tipo II/patología , Islandia , Linfocitos/citología , Linfocitos/metabolismo , MicroARNs/metabolismo , Linaje , Isoformas de Proteínas/genética , ARN Mensajero/metabolismo
18.
Nat Genet ; 50(11): 1616, 2018 11.
Artículo en Inglés | MEDLINE | ID: mdl-30237445

RESUMEN

In the version of this article published, statements about the impact of insertions and deletions on gene conversions were incorrect. We reported a bias toward deletions, whereas in fact the bias was toward insertions. We are deeply indebted to Laurent Duret and Brice Letcher for noticing this mistake in our manuscript. The following statements are incorrect in the published manuscript.

19.
Nat Genet ; 50(12): 1674-1680, 2018 12.
Artículo en Inglés | MEDLINE | ID: mdl-30397338

RESUMEN

De novo mutations (DNMs) cause a large proportion of severe rare diseases of childhood. DNMs that occur early may result in mosaicism of both somatic and germ cells. Such early mutations can cause recurrence of disease. We scanned 1,007 sibling pairs from 251 families and identified 878 DNMs shared by siblings (ssDNMs) at 448 genomic sites. We estimated DNM recurrence probability based on parental mosaicism, sharing of DNMs among siblings, parent-of-origin, mutation type and genomic position. We detected 57.2% of ssDNMs in the parental blood. The recurrence probability of a DNM decreases by 2.27% per year for paternal DNMs and 1.78% per year for maternal DNMs. Maternal ssDNMs are more likely to be T>C mutations than paternal ssDNMs, and less likely to be C>T mutations. Depending on the properties of the DNM, the recurrence probability ranges from 0.011% to 28.5%. We have launched an online calculator to allow estimation of DNM recurrence probability for research purposes.


Asunto(s)
Familia , Patrón de Herencia , Mutación , Relaciones Padres-Hijo , Adulto , Niño , Células Germinales Embrionarias/metabolismo , Composición Familiar , Femenino , Mutación de Línea Germinal , Humanos , Patrón de Herencia/genética , Masculino , Mosaicismo , Linaje
20.
Nat Genet ; 49(11): 1654-1660, 2017 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-28945251

RESUMEN

A fundamental requirement for genetic studies is an accurate determination of sequence variation. While human genome sequence diversity is increasingly well characterized, there is a need for efficient ways to use this knowledge in sequence analysis. Here we present Graphtyper, a publicly available novel algorithm and software for discovering and genotyping sequence variants. Graphtyper realigns short-read sequence data to a pangenome, a variation-aware graph structure that encodes sequence variation within a population by representing possible haplotypes as graph paths. Our results show that Graphtyper is fast, highly scalable, and provides sensitive and accurate genotype calls. Graphtyper genotyped 89.4 million sequence variants in the whole genomes of 28,075 Icelanders using less than 100,000 CPU days, including detailed genotyping of six human leukocyte antigen (HLA) genes. We show that Graphtyper is a valuable tool in characterizing sequence variation in both small and population-scale sequencing studies.


Asunto(s)
Algoritmos , Genoma Humano , Técnicas de Genotipaje/instrumentación , Polimorfismo de Nucleótido Simple , Análisis de Secuencia de ADN/estadística & datos numéricos , Alelos , Secuencia de Bases , Gráficos por Computador , Antígenos HLA/genética , Haplotipos , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Alineación de Secuencia , Análisis de Secuencia de ADN/métodos , Programas Informáticos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA