Pesquisa | Portal Regional da BVS

A high-performance computational workflow to accelerate GATK SNP detection across a 25-genome dataset.

Zhou, Yong; Kathiresan, Nagarajan; Yu, Zhichao; Rivera, Luis F; Yang, Yujian; Thimma, Manjula; Manickam, Keerthana; Chebotarov, Dmytro; Mauleon, Ramil; Chougule, Kapeel; Wei, Sharon; Gao, Tingting; Green, Carl D; Zuccolo, Andrea; Xie, Weibo; Ware, Doreen; Zhang, Jianwei; McNally, Kenneth L; Wing, Rod A.

BMC Biol ; 22(1): 13, 2024 Jan 25.

Artigo em Inglês | MEDLINE | ID: mdl-38273258

RESUMO

BACKGROUND: Single-nucleotide polymorphisms (SNPs) are the most widely used form of molecular genetic variation studies. As reference genomes and resequencing data sets expand exponentially, tools must be in place to call SNPs at a similar pace. The genome analysis toolkit (GATK) is one of the most widely used SNP calling software tools publicly available, but unfortunately, high-performance computing versions of this tool have yet to become widely available and affordable. RESULTS: Here we report an open-source high-performance computing genome variant calling workflow (HPC-GVCW) for GATK that can run on multiple computing platforms from supercomputers to desktop machines. We benchmarked HPC-GVCW on multiple crop species for performance and accuracy with comparable results with previously published reports (using GATK alone). Finally, we used HPC-GVCW in production mode to call SNPs on a "subpopulation aware" 16-genome rice reference panel with ~ 3000 resequenced rice accessions. The entire process took ~ 16 weeks and resulted in the identification of an average of 27.3 M SNPs/genome and the discovery of ~ 2.3 million novel SNPs that were not present in the flagship reference genome for rice (i.e., IRGSP RefSeq). CONCLUSIONS: This study developed an open-source pipeline (HPC-GVCW) to run GATK on HPC platforms, which significantly improved the speed at which SNPs can be called. The workflow is widely applicable as demonstrated successfully for four major crop species with genomes ranging in size from 400 Mb to 2.4 Gb. Using HPC-GVCW in production mode to call SNPs on a 25 multi-crop-reference genome data set produced over 1.1 billion SNPs that were publicly released for functional and breeding studies. For rice, many novel SNPs were identified and were found to reside within genes and open chromatin regions that are predicted to have functional consequences. Combined, our results demonstrate the usefulness of combining a high-performance SNP calling architecture solution with a subpopulation-aware reference genome panel for rapid SNP discovery and public deployment.

Assuntos

Genoma de Planta , Polimorfismo de Nucleotídeo Único , Fluxo de Trabalho , Melhoramento Vegetal , Software , Sequenciamento de Nucleotídeos em Larga Escala/métodos

Pan-genome inversion index reveals evolutionary insights into the subpopulation structure of Asian rice.

Zhou, Yong; Yu, Zhichao; Chebotarov, Dmytro; Chougule, Kapeel; Lu, Zhenyuan; Rivera, Luis F; Kathiresan, Nagarajan; Al-Bader, Noor; Mohammed, Nahed; Alsantely, Aseel; Mussurova, Saule; Santos, João; Thimma, Manjula; Troukhan, Maxim; Fornasiero, Alice; Green, Carl D; Copetti, Dario; Kudrna, David; Llaca, Victor; Lorieux, Mathias; Zuccolo, Andrea; Ware, Doreen; McNally, Kenneth; Zhang, Jianwei; Wing, Rod A.

Nat Commun ; 14(1): 1567, 2023 03 21.

Artigo em Inglês | MEDLINE | ID: mdl-36944612

RESUMO

Understanding and exploiting genetic diversity is a key factor for the productive and stable production of rice. Here, we utilize 73 high-quality genomes that encompass the subpopulation structure of Asian rice (Oryza sativa), plus the genomes of two wild relatives (O. rufipogon and O. punctata), to build a pan-genome inversion index of 1769 non-redundant inversions that span an average of ~29% of the O. sativa cv. Nipponbare reference genome sequence. Using this index, we estimate an inversion rate of ~700 inversions per million years in Asian rice, which is 16 to 50 times higher than previously estimated for plants. Detailed analyses of these inversions show evidence of their effects on gene expression, recombination rate, and linkage disequilibrium. Our study uncovers the prevalence and scale of large inversions (≥100 bp) across the pan-genome of Asian rice and hints at their largely unexplored role in functional biology and crop performance.

Assuntos

Oryza , Oryza/genética , Análise de Sequência de DNA , Genoma de Planta/genética , Evolução Biológica , Filogenia

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA