Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 16 de 16
Filtrar
1.
Nat Comput Sci ; 4(5): 360-366, 2024 May.
Artigo em Inglês | MEDLINE | ID: mdl-38745108

RESUMO

For many genome-wide association studies, imputing genotypes from a haplotype reference panel is a necessary step. Over the past 15 years, reference panels have become larger and more diverse, leading to improvements in imputation accuracy. However, the latest generation of reference panels is subject to restrictions on data sharing due to concerns about privacy, limiting their usefulness for genotype imputation. In this context, here we propose RESHAPE, a method that employs a recombination Poisson process on a reference panel to simulate the genomes of hypothetical descendants after multiple generations. This data transformation helps to protect against re-identification threats and preserves data attributes, such as linkage disequilibrium patterns and, to some degree, identity-by-descent sharing, allowing for genotype imputation. Our experiments on gold-standard datasets show that simulated descendants up to eight generations can serve as reference panels without substantially reducing genotype imputation accuracy.


Assuntos
Estudo de Associação Genômica Ampla , Genótipo , Humanos , Estudo de Associação Genômica Ampla/métodos , Desequilíbrio de Ligação , Haplótipos/genética , Polimorfismo de Nucleotídeo Único/genética , Disseminação de Informação/métodos , Simulação por Computador , Modelos Genéticos , Algoritmos , Genoma Humano/genética , Distribuição de Poisson
2.
Sci Rep ; 14(1): 6227, 2024 03 14.
Artigo em Inglês | MEDLINE | ID: mdl-38486065

RESUMO

Low-coverage imputation is becoming ever more present in ancient DNA (aDNA) studies. Imputation pipelines commonly used for present-day genomes have been shown to yield accurate results when applied to ancient genomes. However, post-mortem damage (PMD), in the form of C-to-T substitutions at the reads termini, and contamination with DNA from closely related species can potentially affect imputation performance in aDNA. In this study, we evaluated imputation performance (i) when using a genotype caller designed for aDNA, ATLAS, compared to bcftools, and (ii) when contamination is present. We evaluated imputation performance with principal component analyses and by calculating imputation error rates. With a particular focus on differently imputed sites, we found that using ATLAS prior to imputation substantially improved imputed genotypes for a very damaged ancient genome (42% PMD). Trimming the ends of the sequencing reads led to similar improvements in imputation accuracy. For the remaining genomes, ATLAS brought limited gains. Finally, to examine the effect of contamination on imputation, we added various amounts of reads from two present-day genomes to a previously downsampled high-coverage ancient genome. We observed that imputation accuracy drastically decreased for contamination rates above 5%. In conclusion, we recommend (i) accounting for PMD by either trimming sequencing reads or using a genotype caller such as ATLAS before imputing highly damaged genomes and (ii) only imputing genomes containing up to 5% of contamination.


Assuntos
DNA Antigo , Genoma , Genótipo , Estudo de Associação Genômica Ampla/métodos , Polimorfismo de Nucleotídeo Único
3.
Bioinformatics ; 39(9)2023 09 02.
Artigo em Inglês | MEDLINE | ID: mdl-37688560

RESUMO

MOTIVATION: The Positional Burrows-Wheeler Transform (PBWT) is a data structure that indexes haplotype sequences in a manner that enables finding maximal haplotype matches in h sequences containing w variation sites in O(hw) time. This represents a significant improvement over classical quadratic-time approaches. However, the original PBWT data structure does not allow for queries over Biobank panels that consist of several millions of haplotypes, if an index of the haplotypes must be kept entirely in memory. RESULTS: In this article, we leverage the notion of r-index proposed for the BWT to present a memory-efficient method for constructing and storing the run-length encoded PBWT, and computing set maximal matches (SMEMs) queries in haplotype sequences. We implement our method, which we refer to as µ-PBWT, and evaluate it on datasets of 1000 Genome Project and UK Biobank data. Our experiments demonstrate that the µ-PBWT reduces the memory usage up to a factor of 20% compared to the best current PBWT-based indexing. In particular, µ-PBWT produces an index that stores high-coverage whole genome sequencing data of chromosome 20 in about a third of the space of its BCF file. µ-PBWT is an adaptation of techniques for the run-length compressed BWT for the PBWT (RLPBWT) and it is based on keeping in memory only a succinct representation of the RLPBWT that still allows the efficient computation of set maximal matches (SMEMs) over the original panel. AVAILABILITY AND IMPLEMENTATION: Our implementation is open source and available at https://github.com/dlcgold/muPBWT. The binary is available at https://bioconda.github.io/recipes/mupbwt/README.html.


Assuntos
Bancos de Espécimes Biológicos , Haplótipos , Sequenciamento Completo do Genoma , Reino Unido
4.
Nat Genet ; 55(7): 1243-1249, 2023 07.
Artigo em Inglês | MEDLINE | ID: mdl-37386248

RESUMO

Phasing involves distinguishing the two parentally inherited copies of each chromosome into haplotypes. Here, we introduce SHAPEIT5, a new phasing method that quickly and accurately processes large sequencing datasets and applied it to UK Biobank (UKB) whole-genome and whole-exome sequencing data. We demonstrate that SHAPEIT5 phases rare variants with low switch error rates of below 5% for variants present in just 1 sample out of 100,000. Furthermore, we outline a method for phasing singletons, which, although less precise, constitutes an important step towards future developments. We then demonstrate that the use of UKB as a reference panel improves the accuracy of genotype imputation, which is even more pronounced when phased with SHAPEIT5 compared with other methods. Finally, we screen the UKB data for loss-of-function compound heterozygous events and identify 549 genes where both gene copies are knocked out. These genes complement current knowledge of gene essentiality in the human genome.


Assuntos
Bancos de Espécimes Biológicos , Genoma Humano , Humanos , Sequenciamento do Exoma , Análise de Sequência de DNA/métodos , Genótipo , Haplótipos , Genoma Humano/genética , Reino Unido , Polimorfismo de Nucleotídeo Único/genética
5.
Nat Genet ; 55(7): 1088-1090, 2023 07.
Artigo em Inglês | MEDLINE | ID: mdl-37386250

RESUMO

The release of 150,119 UK Biobank sequences represents an unprecedented opportunity as a reference panel to impute low-coverage whole-genome sequencing data with high accuracy but current methods cannot cope with the size of the data. Here we introduce GLIMPSE2, a low-coverage whole-genome sequencing imputation method that scales sublinearly in both the number of samples and markers, achieving efficient whole-genome imputation from the UK Biobank reference panel while retaining high accuracy for ancient and modern genomes, particularly at rare variants and for very low-coverage samples.


Assuntos
Bancos de Espécimes Biológicos , Polimorfismo de Nucleotídeo Único , Frequência do Gene , Polimorfismo de Nucleotídeo Único/genética , Genoma , Reino Unido , Genótipo
6.
Nat Commun ; 14(1): 3660, 2023 06 20.
Artigo em Inglês | MEDLINE | ID: mdl-37339987

RESUMO

Due to postmortem DNA degradation and microbial colonization, most ancient genomes have low depth of coverage, hindering genotype calling. Genotype imputation can improve genotyping accuracy for low-coverage genomes. However, it is unknown how accurate ancient DNA imputation is and whether imputation introduces bias to downstream analyses. Here we re-sequence an ancient trio (mother, father, son) and downsample and impute a total of 43 ancient genomes, including 42 high-coverage (above 10x) genomes. We assess imputation accuracy across ancestries, time, depth of coverage, and sequencing technology. We find that ancient and modern DNA imputation accuracies are comparable. When downsampled at 1x, 36 of the 42 genomes are imputed with low error rates (below 5%) while African genomes have higher error rates. We validate imputation and phasing results using the ancient trio data and an orthogonal approach based on Mendel's rules of inheritance. We further compare the downstream analysis results between imputed and high-coverage genomes, notably principal component analysis, genetic clustering, and runs of homozygosity, observing similar results starting from 0.5x coverage, except for the African genomes. These results suggest that, for most populations and depths of coverage as low as 0.5x, imputation is a reliable method that can improve ancient DNA studies.


Assuntos
Genoma Humano , Técnicas de Genotipagem , Humanos , Técnicas de Genotipagem/métodos , Genoma Humano/genética , DNA Antigo , Genótipo , Estudo de Associação Genômica Ampla/métodos , Polimorfismo de Nucleotídeo Único
7.
Bioinformatics ; 39(2)2023 02 03.
Artigo em Inglês | MEDLINE | ID: mdl-36637197

RESUMO

SUMMARY: We introduce mapache, a flexible, robust and scalable pipeline to map, quantify and impute ancient and present-day DNA in a reproducible way. Mapache is implemented in the workflow manager Snakemake and is optimized for low-space consumption, allowing to efficiently (re)map large datasets-such as reference panels and multiple extracts and libraries per sample - to one or several genomes. Mapache can easily be customized or combined with other Snakemake tools. AVAILABILITY AND IMPLEMENTATION: Mapache is freely available on GitHub (https://github.com/sneuensc/mapache). An extensive manual is provided at https://github.com/sneuensc/mapache/wiki. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
DNA Antigo , Software , Genoma , Fluxo de Trabalho
8.
Nat Commun ; 13(1): 6668, 2022 11 05.
Artigo em Inglês | MEDLINE | ID: mdl-36335127

RESUMO

Identical genetic variations can have different phenotypic effects depending on their parent of origin. Yet, studies focusing on parent-of-origin effects have been limited in terms of sample size due to the lack of parental genomes or known genealogies. We propose a probabilistic approach to infer the parent-of-origin of individual alleles that does not require parental genomes nor prior knowledge of genealogy. Our model uses Identity-By-Descent sharing with second- and third-degree relatives to assign alleles to parental groups and leverages chromosome X data in males to distinguish maternal from paternal groups. We combine this with robust haplotype inference and haploid imputation to infer the parent-of-origin for 26,393 UK Biobank individuals. We screen 99 phenotypes for parent-of-origin effects and replicate the discoveries of 6 GWAS studies, confirming signals on body mass index, type 2 diabetes, standing height and multiple blood biomarkers, including the known maternal effect at the MEG3/DLK1 locus on platelet phenotypes. We also report a novel maternal effect at the TERT gene on telomere length, thereby providing new insights on the heritability of this phenotype. All our summary statistics are publicly available to help the community to better characterize the molecular mechanisms leading to parent-of-origin effects and their implications for human health.


Assuntos
Diabetes Mellitus Tipo 2 , Humanos , Masculino , Alelos , Bancos de Espécimes Biológicos , Estudo de Associação Genômica Ampla , Fenótipo , Feminino
9.
Nat Commun ; 13(1): 5107, 2022 08 30.
Artigo em Inglês | MEDLINE | ID: mdl-36042219

RESUMO

The SARS-CoV-2 pandemic has differentially impacted populations across race and ethnicity. A multi-omic approach represents a powerful tool to examine risk across multi-ancestry genomes. We leverage a pandemic tracking strategy in which we sequence viral and host genomes and transcriptomes from nasopharyngeal swabs of 1049 individuals (736 SARS-CoV-2 positive and 313 SARS-CoV-2 negative) and integrate them with digital phenotypes from electronic health records from a diverse catchment area in Northern California. Genome-wide association disaggregated by admixture mapping reveals novel COVID-19-severity-associated regions containing previously reported markers of neurologic, pulmonary and viral disease susceptibility. Phylodynamic tracking of consensus viral genomes reveals no association with disease severity or inferred ancestry. Summary data from multiomic investigation reveals metagenomic and HLA associations with severe COVID-19. The wealth of data available from residual nasopharyngeal swabs in combination with clinical data abstracted automatically at scale highlights a powerful strategy for pandemic tracking, and reveals distinct epidemiologic, genetic, and biological associations for those at the highest risk.


Assuntos
COVID-19 , Pandemias , COVID-19/epidemiologia , Genoma Viral , Estudo de Associação Genômica Ampla , Humanos , SARS-CoV-2/genética
10.
Bioinformatics ; 38(15): 3778-3784, 2022 08 02.
Artigo em Inglês | MEDLINE | ID: mdl-35748697

RESUMO

MOTIVATION: Generation of genotype data has been growing exponentially over the last decade. With the large size of recent datasets comes a storage and computational burden with ever increasing costs. To reduce this burden, we propose XSI, a file format with reduced storage footprint that also allows computation on the compressed data and we show how this can improve future analyses. RESULTS: We show that xSqueezeIt (XSI) allows for a file size reduction of 4-20× compared with compressed BCF and demonstrate its potential for 'compressive genomics' on the UK Biobank whole-genome sequencing genotypes with 8× faster loading times, 5× faster run of homozygozity computation, 30× faster dot products computation and 280× faster allele counts. AVAILABILITY AND IMPLEMENTATION: The XSI file format specifications, API and command line tool are released under open-source (MIT) license and are available at https://github.com/rwk-unil/xSqueezeIt. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Compressão de Dados , Software , Bancos de Espécimes Biológicos , Genômica , Genótipo
11.
Nat Commun ; 12(1): 4842, 2021 08 10.
Artigo em Inglês | MEDLINE | ID: mdl-34376650

RESUMO

Nearby genes are often expressed as a group. Yet, the prevalence, molecular mechanisms and genetic control of local gene co-expression are far from being understood. Here, by leveraging gene expression measurements across 49 human tissues and hundreds of individuals, we find that local gene co-expression occurs in 13% to 53% of genes per tissue. By integrating various molecular assays (e.g. ChIP-seq and Hi-C), we estimate the ability of several mechanisms, such as enhancer-gene interactions, in distinguishing gene pairs that are co-expressed from those that are not. Notably, we identify 32,636 expression quantitative trait loci (eQTLs) which associate with co-expressed gene pairs and often overlap enhancer regions. Due to affecting several genes, these eQTLs are more often associated with multiple human traits than other eQTLs. Our study paves the way to comprehend trait pleiotropy and functional interpretation of QTL and GWAS findings. All local gene co-expression identified here is available through a public database ( https://glcoex.unil.ch/ ).


Assuntos
Regulação da Expressão Gênica , Pleiotropia Genética/genética , Genoma Humano/genética , Estudo de Associação Genômica Ampla/métodos , Polimorfismo de Nucleotídeo Único , Locos de Características Quantitativas/genética , Sítios de Ligação/genética , Ontologia Genética , Estudos de Associação Genética/métodos , Humanos , Sequências Reguladoras de Ácido Nucleico/genética , Fatores de Transcrição/metabolismo
13.
Nat Genet ; 53(1): 120-126, 2021 01.
Artigo em Inglês | MEDLINE | ID: mdl-33414550

RESUMO

Low-coverage whole-genome sequencing followed by imputation has been proposed as a cost-effective genotyping approach for disease and population genetics studies. However, its competitiveness against SNP arrays is undermined because current imputation methods are computationally expensive and unable to leverage large reference panels. Here, we describe a method, GLIMPSE, for phasing and imputation of low-coverage sequencing datasets from modern reference panels. We demonstrate its remarkable performance across different coverages and human populations. GLIMPSE achieves imputation of a genome for less than US$1 in computational cost, considerably outperforming other methods and improving imputation accuracy over the full allele frequency range. As a proof of concept, we show that 1× coverage enables effective gene expression association studies and outperforms dense SNP arrays in rare variant burden tests. Overall, this study illustrates the promising potential of low-coverage imputation and suggests a paradigm shift in the design of future genomic studies.


Assuntos
Análise de Sequência de DNA , Genoma Humano , Genótipo , Humanos , Funções Verossimilhança , Polimorfismo de Nucleotídeo Único/genética , Padrões de Referência
14.
PLoS Genet ; 16(11): e1009049, 2020 11.
Artigo em Inglês | MEDLINE | ID: mdl-33196638

RESUMO

Genotype imputation is the process of predicting unobserved genotypes in a sample of individuals using a reference panel of haplotypes. In the last 10 years reference panels have increased in size by more than 100 fold. Increasing reference panel size improves accuracy of markers with low minor allele frequencies but poses ever increasing computational challenges for imputation methods. Here we present IMPUTE5, a genotype imputation method that can scale to reference panels with millions of samples. This method continues to refine the observation made in the IMPUTE2 method, that accuracy is optimized via use of a custom subset of haplotypes when imputing each individual. It achieves fast, accurate, and memory-efficient imputation by selecting haplotypes using the Positional Burrows Wheeler Transform (PBWT). By using the PBWT data structure at genotyped markers, IMPUTE5 identifies locally best matching haplotypes and long identical by state segments. The method then uses the selected haplotypes as conditioning states within the IMPUTE model. Using the HRC reference panel, which has ∼65,000 haplotypes, we show that IMPUTE5 is up to 30x faster than MINIMAC4 and up to 3x faster than BEAGLE5.1, and uses less memory than both these methods. Using simulated reference panels we show that IMPUTE5 scales sub-linearly with reference panel size. For example, keeping the number of imputed markers constant, increasing the reference panel size from 10,000 to 1 million haplotypes requires less than twice the computation time. As the reference panel increases in size IMPUTE5 is able to utilize a smaller number of reference haplotypes, thus reducing computational cost.


Assuntos
Biologia Computacional/métodos , Estudo de Associação Genômica Ampla/métodos , Haplótipos/genética , Alelos , Previsões/métodos , Frequência do Gene/genética , Genótipo , Humanos , Modelos Teóricos , Polimorfismo de Nucleotídeo Único/genética
15.
medRxiv ; 2020 Sep 01.
Artigo em Inglês | MEDLINE | ID: mdl-32766602

RESUMO

During COVID19 and other viral pandemics, rapid generation of host and pathogen genomic data is critical to tracking infection and informing therapies. There is an urgent need for efficient approaches to this data generation at scale. We have developed a scalable, high throughput approach to generate high fidelity low pass whole genome and HLA sequencing, viral genomes, and representation of human transcriptome from single nasopharyngeal swabs of COVID19 patients.

16.
Cancer Inform ; 14(Suppl 4): 53-65, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-26380549

RESUMO

We introduce a Chaste plugin for the generation and the simulation of Gene Regulatory Networks (GRNs) in multiscale models of multicellular systems. Chaste is a widely used and versatile computational framework for the multiscale modeling and simulation of multicellular biological systems. The plugin, named CoGNaC (Chaste and Gene Networks for Cancer), allows the linking of the regulatory dynamics to key properties of the cell cycle and of the differentiation process in populations of cells, which can subsequently be modeled using different spatial modeling scenarios. The approach of CoGNaC focuses on the emergent dynamical behavior of gene networks, in terms of gene activation patterns characterizing the different cellular phenotypes of real cells and, especially, on the overall robustness to perturbations and biological noise. The integration of this approach within Chaste's modular simulation framework provides a powerful tool to model multicellular systems, possibly allowing for the formulation of novel hypotheses on gene regulation, cell differentiation, and, in particular, cancer emergence and development. In order to demonstrate the usefulness of CoGNaC over a range of modeling paradigms, two example applications are presented. The first of these concerns the characterization of the gene activation patterns of human T-helper cells. The second example is a multiscale simulation of a simplified intestinal crypt, in which, given certain conditions, tumor cells can emerge and colonize the tissue.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA