Pesquisa | BVS - MINISTÉRIO DA SAÚDE

Analysis-ready VCF at Biobank scale using Zarr.

Czech, Eric; Millar, Timothy R; White, Tom; Jeffery, Ben; Miles, Alistair; Tallman, Sam; Wojdyla, Rafal; Zabad, Shadi; Hammerbacher, Jeff; Kelleher, Jerome.

bioRxiv ; 2024 Jun 12.

Artigo em Inglês | MEDLINE | ID: mdl-38915693

RESUMO

Background: Variant Call Format (VCF) is the standard file format for interchanging genetic variation data and associated quality control metrics. The usual row-wise encoding of the VCF data model (either as text or packed binary) emphasises efficient retrieval of all data for a given variant, but accessing data on a field or sample basis is inefficient. Biobank scale datasets currently available consist of hundreds of thousands of whole genomes and hundreds of terabytes of compressed VCF. Row-wise data storage is fundamentally unsuitable and a more scalable approach is needed. Results: We present the VCF Zarr specification, an encoding of the VCF data model using Zarr which makes retrieving subsets of the data much more efficient. Zarr is a cloud-native format for storing multi-dimensional data, widely used in scientific computing. We show how this format is far more efficient than standard VCF based approaches, and competitive with specialised methods for storing genotype data in terms of compression ratios and calculation performance. We demonstrate the VCF Zarr format (and the vcf2zarr conversion utility) on a subset of the Genomics England aggV2 dataset comprising 78,195 samples and 59,880,903 variants, with a 5X reduction in storage and greater than 300X reduction in CPU usage in some representative benchmarks. Conclusions: Large row-encoded VCF files are a major bottleneck for current research, and storing and processing these files incurs a substantial cost. The VCF Zarr specification, building on widely-used, open-source technologies has the potential to greatly reduce these costs, and may enable a diverse ecosystem of next-generation tools for analysing genetic variation data directly from cloud-based object stores.

V-primer: software for the efficient design of genome-wide InDel and SNP markers from multi-sample variant call format (VCF) genotyping data.

Natsume, Satoshi; Oikawa, Kaori; Nomura, Chihiro; Ito, Kazue; Utsushi, Hiroe; Shimizu, Motoki; Terauchi, Ryohei; Abe, Akira.

Breed Sci ; 73(4): 415-420, 2023 Sep.

Artigo em Inglês | MEDLINE | ID: mdl-38106505

RESUMO

DNA markers are indispensable tools in genetics and genomics research as well as in crop breeding, particularly for marker-assisted selection. Recent advances in next-generation sequencing technology have made it easier to obtain genome sequences for various crop species, enabling the large-scale identification of DNA polymorphisms among varieties, which in turn has made DNA marker design more accessible. However, existing primer design software is not suitable for designing many types of genome-wide DNA markers from next-generation sequencing data. Here, we describe the development of V-primer, high-throughput software for designing insertion/deletion, cleaved amplified polymorphic sequence, and single-nucleotide polymorphism (SNP) markers. We validated the applicability of these markers in different crops. In addition, we performed multiplex PCR targeted amplicon sequencing using SNP markers designed with V-primer. Our results demonstrate that V-primer facilitates the efficient and accurate design of primers and is thus a useful tool for genetics, genomics, and crop breeding. V-primer is freely available at https://github.com/ncod3/vprimer.

Secure Comparisons of Single Nucleotide Polymorphisms Using Secure Multiparty Computation: Method Development.

Woods, Andrew; Kramer, Skyler T; Xu, Dong; Jiang, Wei.

JMIR Bioinform Biotechnol ; 4: e44700, 2023 Jul 18.

Artigo em Inglês | MEDLINE | ID: mdl-38935952

RESUMO

BACKGROUND: While genomic variations can provide valuable information for health care and ancestry, the privacy of individual genomic data must be protected. Thus, a secure environment is desirable for a human DNA database such that the total data are queryable but not directly accessible to involved parties (eg, data hosts and hospitals) and that the query results are learned only by the user or authorized party. OBJECTIVE: In this study, we provide efficient and secure computations on panels of single nucleotide polymorphisms (SNPs) from genomic sequences as computed under the following set operations: union, intersection, set difference, and symmetric difference. METHODS: Using these operations, we can compute similarity metrics, such as the Jaccard similarity, which could allow querying a DNA database to find the same person and genetic relatives securely. We analyzed various security paradigms and show metrics for the protocols under several security assumptions, such as semihonest, malicious with honest majority, and malicious with a malicious majority. RESULTS: We show that our methods can be used practically on realistically sized data. Specifically, we can compute the Jaccard similarity of two genomes when considering sets of SNPs, each with 400,000 SNPs, in 2.16 seconds with the assumption of a malicious adversary in an honest majority and 0.36 seconds under a semihonest model. CONCLUSIONS: Our methods may help adopt trusted environments for hosting individual genomic data with end-to-end data security.

RESUMO

RESUMO

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA